[Studies in Computational Intelligence] Computational Intelligence in Multimedia Processing: Recent Advances Volume 96 ||

Download [Studies in Computational Intelligence] Computational Intelligence in Multimedia Processing: Recent Advances Volume 96 ||

Post on 23-Dec-2016




1 download








  • Aboul-Ella Hassanien, Ajith Abraham and Janusz Kacprzyk (Eds.)

    Computational Intelligence in Multimedia Processing: Recent Advances

  • Studies in Computational Intelligence, Volume 96

    Editor-in-chiefProf. Janusz KacprzykSystems Research InstitutePolish Academy of Sciencesul. Newelska 601-447 WarsawPolandE-mail: kacprzyk@ibspan.waw.pl

    Further volumes of this series can be found on ourhomepage: springer.com

    Vol. 71. Norio Baba, Lakhmi C. Jain and Hisashi Handa(Eds.)Advanced Intelligent Paradigms in ComputerGames, 2007ISBN 978-3-540-72704-0

    Vol. 72. Raymond S.T. Lee and Vincenzo Loia (Eds.)Computation Intelligence for Agent-based Systems, 2007ISBN 978-3-540-73175-7

    Vol. 73. Petra Perner (Ed.)Case-Based Reasoning on Images and Signals, 2008ISBN 978-3-540-73178-8

    Vol. 74. Robert SchaeferFoundation of Global Genetic Optimization, 2007ISBN 978-3-540-73191-7

    Vol. 75. Crina Grosan, Ajith Abraham and Hisao Ishibuchi(Eds.)Hybrid Evolutionary Algorithms, 2007ISBN 978-3-540-73296-9

    Vol. 76. Subhas Chandra Mukhopadhyay and Gourab SenGupta (Eds.)Autonomous Robots and Agents, 2007ISBN 978-3-540-73423-9

    Vol. 77. Barbara Hammer and Pascal Hitzler (Eds.)Perspectives of Neural-Symbolic Integration, 2007ISBN 978-3-540-73953-1

    Vol. 78. Costin Badica and Marcin Paprzycki (Eds.)Intelligent and Distributed Computing, 2008ISBN 978-3-540-74929-5

    Vol. 79. Xing Cai and T.-C. Jim Yeh (Eds.)Quantitative Information Fusion for HydrologicalSciences, 2008ISBN 978-3-540-75383-4

    Vol. 80. Joachim DiederichRule Extraction from Support Vector Machines, 2008ISBN 978-3-540-75389-6

    Vol. 81. K. SridharanRobotic Exploration and Landmark Determination, 2008ISBN 978-3-540-75393-3

    Vol. 82. Ajith Abraham, Crina Grosan and WitoldPedrycz (Eds.)Engineering Evolutionary Intelligent Systems, 2008ISBN 978-3-540-75395-7

    Vol. 83. Bhanu Prasad and S.R.M. Prasanna (Eds.)Speech, Audio, Image and Biomedical Signal Processingusing Neural Networks, 2008ISBN 978-3-540-75397-1

    Vol. 84. Marek R. Ogiela and Ryszard TadeusiewiczModern Computational Intelligence Methodsfor the Interpretation of Medical Images, 2008ISBN 978-3-540-75399-5Vol. 85. Arpad Kelemen, Ajith Abraham and Yulan Liang(Eds.)Computational Intelligence in Medical Informatics, 2008ISBN 978-3-540-75766-5

    Vol. 86. Zbigniew Les and Mogdalena LesShape Understanding Systems, 2008ISBN 978-3-540-75768-9

    Vol. 87. Yuri Avramenko and Andrzej KraslawskiCase Based Design, 2008ISBN 978-3-540-75705-4

    Vol. 88. Tina Yu, David Davis, Cem Baydar and RajkumarRoy (Eds.)Evolutionary Computation in Practice, 2008ISBN 978-3-540-75770-2

    Vol. 89. Ito Takayuki, Hattori Hiromitsu, Zhang Minjieand Matsuo Tokuro (Eds.)Rational, Robust, Secure, 2008ISBN 978-3-540-76281-2

    Vol. 90. Simone Marinai and Hiromichi Fujisawa (Eds.)Machine Learning in Document Analysisand Recognition, 2008ISBN 978-3-540-76279-9

    Vol. 91. Horst Bunke, Kandel Abraham and Last Mark (Eds.)Applied Pattern Recognition, 2008ISBN 978-3-540-76830-2

    Vol. 92. Ang Yang, Yin Shan and Lam Thu Bui (Eds.)Success in Evolutionary Computation, 2008ISBN 978-3-540-76285-0

    Vol. 93. Manolis Wallace, Marios Angelides and PhivosMylonas (Eds.)Advances in Semantic Media Adaptation andPersonalization, 2008ISBN 978-3-540-76359-8

    Vol. 94. Arpad Kelemen, Ajith Abraham and Yuehui Chen(Eds.)Computational Intelligence in Bioinformatics, 2008ISBN 978-3-540-76802-9

    Vol. 95. Radu DogaruSystematic Design for Emergence in Cellular NonlinearNetworks, 2008ISBN 978-3-540-76800-5

    Vol. 96. Aboul-Ella Hassanien, Ajith Abraham and JanuszKacprzyk (Eds.)Computational Intelligence in Multimedia Processing:Recent Advances, 2008ISBN 978-3-540-76826-5

  • Aboul-Ella HassanienAjith AbrahamJanusz Kacprzyk(Eds.)

    Computational Intelligencein Multimedia Processing:Recent Advances

    With 196 Figures and 29 Tables


  • Aboul-Ella HassanienQuantitative and Information SystemDepartmentKuwait University College of BusinessAdministrationP.O. Box 5486Safat Code No. 13055Kuwait


    Ajith AbrahamCentre for Quantifiable Quality of Servicein Communication Systems (Q2S)Centre of ExcellenceNorwegian University of Scienceand TechnologyO.S. Bragstads plass 2E7491 TrondheimNorway


    Janusz KacprzykSystems Research InstitutePolish Academy of SciencesNewelska 601-447 Warsaw


    ISBN 978-3-540-76826-5 e-ISBN 978-3-540-76827-2

    Studies in Computational Intelligence ISSN 1860-949X

    Library of Congress Control Number: 2007940846

    c 2008 Springer-Verlag Berlin Heidelberg

    This work is subject to copyright. All rights are reserved, whether the whole or part of the materialis concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad-casting, reproduction on microfilm or in any other way, and storage in data banks. Duplication ofthis publication or parts thereof is permitted only under the provisions of the German Copyright Lawof September 9, 1965, in its current version, and permission for use must always be obtained fromSpringer-Verlag. Violations are liable to prosecution under the German Copyright Law.

    The use of general descriptive names, registered names, trademarks, etc. in this publication does notimply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.

    Cover design: Deblik, Berlin, Germany

    Printed on acid-free paper

    9 8 7 6 5 4 3 2 1


  • Preface

    Multimedia uses multiple forms of information content and processing mainlytext, audio, graphics, animation, and video for communication to cater thevarious user demands. Today, multimedia presentation, etc. are used inmovies, education, entertainment, marketing, advertising, information ser-vices, teleconferencing, publishing, interactive television, product demonstra-tion, and alike. Because of the rapid transfer of information and a growingneed to present this information in a powerful way, only individuals who haveappropriate skills and knowledge to communicate eectively will succeed inthe multimedia industry.

    In the last few years, multimedia processing has emerged as an importanttechnology to generate contents based on images, audio, graphics, animation,full-motion video, and text, and it has opened a wide range of applicationsby combining these dierent of information sources thus giving insights in theinterpretation of the multimedia content. Furthermore, recent new develop-ments such as the high-denition multimedia content and interactive televisioncan lead to the generation of a huge volume of data and imply serious com-puting problems connected with the creation, processing, and management ofmultimedia content. Multimedia processing is a challenging domain for sev-eral reasons as: it requires both high-computational processing requirementsand memory bandwidth; it is a multi-rate computing problem; and it requireslow-cost implementations for high-volume markets.

    Computational intelligence is one of the most exciting and rapidly ex-panding elds which attract a large number of scholars, researchers, engineersand practitioners working in such areas as rough sets, neural networks, fuzzylogic, evolutionary computing, articial immune systems, and swarm intel-ligence. Computational intelligence has been a tremendously active area ofresearch for the past decade or so. There are many successful applications ofcomputational intelligence in many subelds of multimedia, including imageprocessing or retrieval, audio processing, and text processing. However, thereare still numerous open problems in multimedia processing exemplied bymultimedia communication, multimedia computing and computer animation

  • VI Preface

    that need advanced and ecient computational methodologies desperately todeal with the huge volumes of data generated by these problems.

    This volume provides an up-to-date and state-of-the-art coverage of diverseaspects related to computational intelligence in multimedia processing. It ad-dresses the use of dierent computational intelligence-based approaches tovarious problems in multimedia computing, networking and communicationssuch as video processing, virtual reality, movies, audio processing, informa-tion graphics in multimodal documents, multimedia tasks scheduling, mod-eling interactive nonlinear stories, video authentication, text localization inimages, organizing multimedia information, and visual sensor networks. Thisvolume comprises of 19 chapters including an overview chapter providing anup-to-date and state-of-the review of the current literature on computationalintelligence-based approaches to various problems in multimedia computingand communication, and some important research challenges.

    The book is divided into ve parts devoted to: foundation of computa-tional intelligence in multimedia processing, computational intelligence in 3Dmultimedia virtual environment and video games, computational intelligencein image/audio processing, computational intelligence in multimedia networkstask scheduling; and computational intelligence in video processing.

    The part on Foundation of computational intelligence in multimediaprocessing contains two introductory chapters. It presents a broad overview ofcomputational intelligence (CI) techniques including Neural Network (NN),Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Fuzzy Set(FS), Reinforcement Learning (RL) and Rough Sets (RS). In addition, avery brief introduction to near sets and near images which oer a gener-alization of traditional rough set theory and a new approach to classifyingperceptual objects by means of features in solving multimedia problems ispresented. A review of the current literature on CI-based approaches to var-ious problems in multimedia computing, networking and communications ispresented. Challenges to be addressed and future directions of research arealso presented.

    Chapter 1, by Aboul-Ella Hassanien, Ajith Abraham, Janusz Kacprzyk,and James F. Peters, presents a review of the current literature on compu-tational intelligence-based approaches to various problems in the multimediacomputing such as speech, audio and image processing, video watermarking,content-based multimedia indexing, and retrieval. The chapter also discussessome representative methods to provide inspiring examples to illustrate howCI could be applied to resolve multimedia computing problems and how mul-timedia could be analyzed, processed, and characterized by computationalintelligence.

    Chapter 2, by Parthasarathy Guturu, presents a review of the current lit-erature on computational intelligence-based approaches to various problemsin multimedia networking and communications such as call admission control,management of resources and trac, routing, multicasting, media composi-

  • Preface VII

    tion, encoding, media streaming and synchronization, and on-demand serversand services.

    The part on Computational intelligence in 3D multimedia virtual envi-ronment and video games contains four chapters. It discusses the applicationof computational intelligence techniques in the area of virtual environment(in which humans can interact with a virtual 3D scene and navigate througha virtual environment) and music information retrieval approaches. Dynamicmodels are also employed to obtain a more formal design process for (story-driven) games and on improving the current approaches to interactive story-telling.

    In Chap. 3, Ronald Genswaider, Helmut Berger, Michael Dittenbach,Andreas Pesenhofer, Dieter Merkl, Andreas Rauber, and Thomas Lidy intro-duce the MediaSquare, a synthetic 3D multimedia environment that allowsmultiple users to collectively explore multimedia data and interact with eachother. The data is organized within the 3D virtual world either based on con-tent similarity, or by mapping a given structure (e.g. a branch of a le systemhierarchy) into a room structure. With this system it is possible to take advan-tage of spatial metaphors such as relations between items in space, proximityand action, common reference and orientation, as well as reciprocity.

    In Chap. 4, Tauseef Gulrez, Manolya Kavakli, and Alessandro Tognettideveloped a testbed for robot-mediated neurorehabilitation therapy that com-bines the use of robotics, computationally intelligent virtual reality, and hapticinterfaces. They employed the theories of neuroscience and rehabilitation todevelop methods for the treatment of neurological injuries such as stroke,spinal cord injury, and traumatic brain injury. As a sensor input they haveused two state-of-the-art technologies, depicting the two dierent approachesto solve the mobility loss problem. In their experiment, a 52 piezo-resistivesensor laden shirt was used as an input device to capture the residual signalsarising from the patients body.

    In Chap. 5, Fabio Zambetta builds the case for a story-driven approach tothe design of a computer role-playing game using a mathematical model ofpolitical balance and conict and scripting based on fuzzy logic. The modelintroduced diers from a standard HCP (hybrid control process) by the useof fuzzy logic (or fuzzy-state machines) to handle events, while an ordinarydierential equation is needed to generate continuous level of conict overtime. By using this approach, not only can game designers express game playproperties formally using a quasi-natural language, but they can also proposea diverse role-playing experience to their players. The interactive game storiesdesigned with this methodology can change under the pressure of a variablepolitical balance, and propose a dierent and innovative game play style.

    Time ow is the distinctive structure of various kinds of data, such as mul-timedia movie, electrocardiogram, and stock price quote. To make good use ofthese data, locating desired instant or interval along the time is indispensable.In addition to domain specic methods like automatic TV program segmen-tation, there should be a common means to search these data according to

  • VIII Preface

    the changes along the time ow. Chapter 6, by Ken Nakayama et al. presentsI-string and I-regular expression framework with some examples and a match-ing algorithm. I-string is a symbolic string-like annotation model for contin-uous media which has a virtual continuous branchless time ow. I-regularexpression is a pattern language over I-string, which is an extension of conven-tional regular expression for text search. Although continuous media are oftentreated as a sequence of time-sliced data in practice, the framework adoptscontinuous time ow. This abstraction allows the annotation and search queryto be independent from low-level implementation such as frame rate.

    Computational intelligence in image/audio processing is the third part ofthe book. It contains six chapters discussing the application of computationalintelligence techniques in image and audio processing.

    In Chap. 7, Barca J.C., Rumantir G., and Li R., present a set of il-luminated contour-based markers for optical motion capture that has beenpresented along with a modied K-means algorithm that can be used forremoving inter-frame noise. The new markers appear to have features thatsolve and/or reduce several of the drawbacks associated with other markersystems currently available for optical motion capture. The new markers pro-vide solutions to central problems with the current standard spherical ashingLED-based markers. The modied K-means algorithm that can be used forremoving noise in optical motion capture data is guided by constraints on thecompactness and number of data points per cluster. Experiments on the pre-sented algorithm and ndings in literature indicate that this noise-removingalgorithm outperforms standard ltering algorithms such as the mean and me-dian because it is capable of completely removing noise with both the spikeand Gaussian characteristics.

    In Chap. 8, Sandra Carberry and Stephanie Elzer present a corpus studythat shows the importance of taking information graphics into account whenprocessing a multimodal document. It then presents a Bayesian network ap-proach to identifying the message conveyed by one kind of information graphic,simple bar charts, along with an evaluation of the graph understanding system.

    In Chap. 9, Klaas Bosteels and Etienne E. Kerre present a recently intro-duced triparametric family of fuzzy similarity measures, together with severalconstraints on its parameters that warrant certain potentially desirable oruseful properties. In particular, they present constraints for several forms ofrestrictability, which allow reducing the computation time in practical appli-cations. They use some members of this family to construct various audiosimilarity measures based on spectrum histograms and uctuation patterns.

    Chapter 10, by Przemyslaw Gorecki, Laura Caponetti, and Ciro Castiello,deals with the particular problem of text localization, which aims at deter-mining the exact location where the text is situated inside a document image.The strict connection between text localization and image segmentation ishighlighted in the chapter and a review of methods for image segmentationis proposed. Particularly, the benets of this chapter and the employment offuzzy and neuro-fuzzy techniques in this eld are assessed, thus indicating

  • Preface IX

    a way to combine computational intelligence methods and document imageanalysis. Three peculiar methods based on image segmentation are presentedto show dierent applications of fuzzy and neuro-fuzzy techniques in the con-text of text localization.

    In Chap. 11, Kui Wu and Kim-Hui Yap, present a soft-labeling frame-work that addresses the small sample problem in interactive CBIR systems.The technique incorporates soft-labeled images into the fuzzy support vectormachine (FSVM) for eective learning along with labeled images for eectiveretrieval. By exploiting the characteristics of the labeled images, soft-labeledimages are selected through an unsupervised clustering algorithm. Further,the relevance of the soft-labeled images is estimated using the fuzzy member-ship function. The FSVM-based active learning is then performed based onthe hybrid of soft-labeled and explicitly labeled images. Experimental resultsbased on a database of 10,000 images demonstrate the eectiveness of theproposed method.

    Temporal textures are textures with motion like real world image sequencesof sea-waves, smoke, etc. that possess some stationary properties over spaceand time. The motion assembly by a ock of ying birds, water streams,uttering leaves, and waving ags also serve to illustrate such a motion. Thecharacterization of temporal textures is of a vital importance to computervision, electronic entertainment, and content-based video coding research witha number of potential applications in areas including recognition (automatedsurveillance and industrial monitoring), synthesis (animation and computergames), and segmentation (robot navigation and MPEG-4). Chapter 12, byAshfaqur Rahman and Manzur Murshed, provides a comprehensive literaturesurvey of the existing temporal texture characterization techniques.

    The fourth part, Computational intelligence in multimedia networks andtask scheduling contains four chapters that describe several approaches todevelop video analysis and segmentation systems based on visual sensor net-works using computational intelligence as well as a discussion about detectinghotspots in the cockpits in view of the Swissair 111 and ValuJet 592 ightdisasters, and answer the question that how distributed sensor networks couldhelp in near real-time event detection, disambiguating faults and events byusing articial intelligence techniques. In addition, it contains a chapter re-viewing the current literature on computational intelligence-based approachesto various problems in multimedia networking and communications.

    In Chap. 13, Mitsuo Gen and Myungryun Yoo discuss a task schedulingproblem by introducing many scheduling algorithms for soft real-time tasks us-ing a genetic algorithm (GA). They propose reasonable solutions for NP-hardscheduling problem with much less diculties than those solved by traditionalmathematical methods. In addition, a continuous task scheduling, real-timetask scheduling on homogeneous system and real-time task scheduling on het-erogeneous system are discussed and included in this chapter.

    Chapter 14, by Miguel A. Patricio, F. Castanedo, A. Berlanga, O. Perez,J. Garcia, and Jose M. Molina, describes several approaches to develop video

  • X Preface

    analysis and segmentation systems based on visual sensor networks usingcomputational intelligence. They discuss how computational intelligence para-digms can help obtain competitive solutions. The knowledge about the domainis used in the form of fuzzy rules for data association and heuristic evaluationfunctions to optimize the design and guide the search of appropriate decisions.

    In Chap. 15, Slawomir T. Wierzchon, Krzysztof Ciesielski, and MieczyslawA. Klopotek, focus on some problems concerning application of an immune-based algorithm to extraction and visualization of cluster structure. The chap-ter presents a novel approach, based on articial immune systems, within abroad stream of map type clustering methods. Such approach leads to manyinteresting research issues, such as context-dependent dictionary reductionand keywords identication, topic-sensitive document summarization, subjec-tive model visualization based on particular users information requirements,dynamic adaptation of the document representation and local similarity mea-sure computation.

    In Chap. 16, S. Srivathsan, N. Balakrishnan, and S.S. Iyengar discusssome safety issues in commercial planes particularly focusing on hazards inthe cockpit area. The chapter discusses a few methodologies to detect criticalfeatures and provides unambiguous information about the possible sourcesof hazards to the end user in near real time. They explore the applicationof Bayesian probability, the IyengarKrishnamachari method, probabilisticreasoning, reasoning under uncertainty, and the DempsterShafer theory, andanalyze how these theories could help in the data analysis gathered fromwireless sensor networks deployed in the cockpit area.

    The nal part of the book deals with the use of computational intelli-gence in video processing. It contains three chapters which discuss the use ofcomputational intelligence techniques in video processing.

    In Chap. 17, Nicholas Vretos, Vassilios Solachidis, and Ioannis Pitas pro-vide a uniform framework by which media analysis can be rendered moreuseful for retrieval applications as well as for human-computer interaction-based application. All the algorithms presented in this chapter are focused onhumans and thus provides interesting features for an anthropocentric analysisof a movie.

    In Chap. 18, Thomas Barecke, Ewa Kijak, Marcin Detyniecki, and AndreasNurnberger present an innovative way of automatically organizing multimediainformation to facilitate content-based browsing. It is based on self-organizingmaps. The visualization capabilities of the self-organizing map provide an in-tuitive way of representing the distribution of data as well as the object simi-larities. The main idea is to visualize similar documents spatially close to eachother, while the distance between dierent documents is larger. They intro-duce a novel time bar visualization that re-projects the temporal information.

    In Chap. 19, Mayank Vatsa, Richa Singh, Sanjay K. Singh, and SaurabhUpadhyay, present an ecient intelligent video authentication algorithm usingsupport vector machine. The proposed algorithm can detect multiple videotampering attacks. It computes the local relative correlation information and

  • Preface XI

    classies the video that is nontampered. The proposed algorithm computesthe relative correlation information between all the adjacent frames of a videoand projects them into a nonlinear SVM hyperplane to determine if the videois tampered or not. The algorithm is validated on an extensive video databasecontaining 795 tampered and nontampered videos. The results show that theproposed algorithm yields a classication accuracy of 99.2%.


    We are very much grateful to the authors of this volume and to the re-viewers for their extraordinary service by critically reviewing the chapters.Most of the authors of the chapters included in this book also served asreferees for the chapters written by other authors. Thanks go to all thosewho provided constructive and comprehensive reviews. The editors thank Dr.Thomas Ditzinger Springer-Verlag, Germany for the editorial assistance andexcellent cooperative collaboration to produce this important scientic work.We hope that the reader will share our excitement to present this volumeon Computational Intelligence in Multimedia Processing: Recent Advance andwill nd it useful.

    Aboul-Ella HassanienJanusz Kacprzyk

    Ajith Abraham

  • Contents

    Part I Foundation

    Computational Intelligence in Multimedia Processing:Foundation and TrendsAboul-Ella Hassanien, Ajith Abraham, Janusz Kacprzyk,and James F. Peters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    Computational Intelligence in Multimedia Networkingand Communications: Trends and Future DirectionsParthasarathy Guturu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    Part II Computational Intelligence in 3D Multimedia VirtualEnvironment and Video Games

    A Synthetic 3D Multimedia EnvironmentRonald Genswaider, Helmut Berger, Michael Dittenbach, AndreasPesenhofer, Dieter Merkl, Andreas Rauber, and Thomas Lidy . . . . . . . . . . 79

    Robotics and Virtual Reality: A Marriage of Two DiverseStreams of ScienceTauseef Gulrez, Manolya Kavakli, and Alessandro Tognetti . . . . . . . . . . . . 99

    Modelling Interactive Non-Linear StoriesFabio Zambetta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    A Time Interval String Model for Annotating and SearchingLinear Continuous MediaKen Nakayama, Kazunori Yamaguchi, Theodorus Eric Setiadi,Yoshitake Kobayashi, Mamoru Maekawa, Yoshihisa Nitta,and Akihiko Ohsuga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

  • XIV Contents

    Part III Computational Intelligence in Image/Audio Processing

    Noise Filtering of New Motion Capture MarkersUsing Modied K-MeansJ.C. Barca, G. Rumantir, and R. Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

    Toward Eective Processing of Information Graphicsin Multimodal Documents: A Bayesian Network ApproachSandra Carberry and Stephanie Elzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

    Fuzzy Audio Similarity Measures Basedon Spectrum Histograms and Fluctuation PatternsKlaas Bosteels and Etienne E. Kerre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

    Fuzzy Techniques for Text Localisation in ImagesPrzemyslaw Gorecki, Laura Caponetti, and Ciro Castiello . . . . . . . . . . . . . 233

    Soft-Labeling Image Scheme Using Fuzzy Support VectorMachineKui Wu and Kim-Hui Yap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

    Temporal Texture Characterization: A ReviewAshfaqur Rahman and Manzur Murshed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

    Part IV Computational Intelligence in Multimedia Networksand Task Scheduling

    Real Time Tasks Scheduling Using Hybrid Genetic AlgorithmMitsuo Gen and Myungryun Yoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

    Computational Intelligence in Visual Sensor Networks:Improving Video Processing SystemsMiguel A. Patricio, F. Castanedo, A. Berlanga, O. Perez, J. Garca,and Jose M. Molina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

    Scalability and Evaluation of Contextual Immune Modelfor Web MiningSlawomir T. Wierzchon, Krzysztof Ciesielski, and MieczyslawA. Klopotek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

    Critical Feature Detection in Cockpits Application of AIin Sensor NetworksS. Srivathsan, N. Balakrishnan, and S.S. Iyengar . . . . . . . . . . . . . . . . . . . . 409

  • Contents XV

    Part V Computational Intelligence in Video Processing

    Anthropocentric Semantic Information Extractionfrom MoviesNicholas Vretos, Vassilios Solachidis, and Ioannis Pitas . . . . . . . . . . . . . . . 437

    Organizing Multimedia Information with MapsThomas Barecke, Ewa Kijak, Marcin Detyniecki,and Andreas Nurnberger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

    Video Authentication Using Relative Correlation Informationand SVMMayank Vatsa, Richa Singh, Sanjay K. Singh, and Saurabh Upadhyay . . 511

    Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531

  • Part I


  • Computational Intelligence in MultimediaProcessing: Foundation and Trends

    Aboul-Ella Hassanien1,2, Ajith Abraham3, Janusz Kacprzyk4,and James F. Peters5

    1 Information Technology Department, FCICairo University5 Ahamed Zewal Street, Orman, Giza, Egypta.hassanien@fci-cu.edu.eg

    2 Information System Department, CBAKuwait University, Kuwaitabo@cba.edu.kw

    3 Center for Quantiable Quality of Service in Communication SystemsNorwegian University of Science and TechnologyO.S. Bragstads plass 2E, N-7491 Trondheim, Norwayajith.abraham@ieee.org, abraham.ajith@acm.org

    4 Systems Research Institute Polish Academy of Sciencesul. Newelska 6 01-447 Warsaw, Polandkacprzyk@ibspan.waw.pl

    5 Department of Electrical and Computer EngineeringUniversity of ManitobaWinnipeg, Manitoba R3T 5V6, Canadajfpeters@ee.umanitoba.ca

    Summary. This chapter presents a broad overview of Computational Intelligence(CI) techniques including Neural Network (NN), Particle Swarm Optimization(PSO), Evolutionary Algorithm (GA), Fuzzy Set (FS), and Rough Sets (RS). Inaddition, a very brief introduction to near sets and near images which oer a gener-alization of traditional rough set theory and a new approach to classifying perceptualobjects by means of features in solving multimedia problems is presented. A reviewof the current literature on CI based approaches to various problems in multime-dia computing such as speech, audio and image processing, video watermarking,content-based multimedia indexing and retrieval are presented. We discuss somerepresentative methods to provide inspiring examples to illustrate how CI could beapplied to resolve multimedia computing problems and how multimedia could beanalyzed, processed, and characterized by computational intelligence. Challenges tobe addressed and future directions of research are also presented.

    A.-E. Hassanien et al.: Computational Intelligence in Multimedia Processing: Foundation and

    Trends, Studies in Computational Intelligence (SCI) 96, 349 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 4 A.-E. Hassanien et al.

    1 Introduction

    Last few decades have seen a new era of articial intelligence focusing on theprinciples, theoretical aspects, and design methodology of algorithms gleanedfrom nature. Examples are articial neural networks inspired by mammalianneural systems, evolutionary computation inspired by natural selection in bi-ology, simulated annealing inspired by thermodynamics principles and swarmintelligence inspired by collective behavior of insects or micro-organisms, etc.,interacting locally with their environment causing coherent functional globalpatterns to emerge. Computational intelligence is a well-established paradigm,where new theories with a sound biological understanding have been evolv-ing. The current experimental systems have many of the characteristics ofbiological computers (brains in other words) and are beginning to be builtto perform a variety of tasks that are dicult or impossible to do withconventional computers. Dening computational intelligence is not an easytask [95]. In a nutshell, which becomes quite apparent in light of the currentresearch pursuits, the area is heterogeneous as being dwelled on such technolo-gies as neural networks, fuzzy systems, rough sets, evolutionary computation,swarm intelligence, probabilistic reasoning [13] and multi-agent systems. Therecent trend is to integrate dierent components to take advantage of com-plementary features and to develop a synergistic system. Hybrid architectureslike neuro-fuzzy systems, evolutionary-fuzzy systems, evolutionary-neural net-works, evolutionary neuro-fuzzy systems, rough-neural, rough-fuzzy, etc., arewidely applied for real world problem solving.

    Multimedia is any multiple forms of media integrated together at a time.In modern times, the advent of musical accompaniment to silent lms was anearly form of multimedia. Even the simplest ancient dance forms use multiplemedia types in the form of sound and vision to convey additional meaning. Thecurrently accepted understanding of multimedia generally involves a varietyof media, such as still images, video, sound, music and text, presented usinga computer as the storage device, delivery controller and delivery medium.The various media types are usually stored as digital assets and their deliveryto the viewer is facilitated by some sort of authoring language. The multi-media technology is one kind development rapid the natural subinformationtechnique, it changes computer and brings a profound revolution. Multimediatechnique will accelerate the development of our live. Even nowadays, mostmedia types are only designed to be perceived by two senses, vision and hear-ing. Still, incredibly powerful messages can be communicated using just thesetwo senses. A subset of multimedia is interactive multimedia. In this denitionthe delivery of the assets is dependent on decisions made by the viewer at thetime of viewing. Some subject areas lend themselves to interactivity, such asself-paced learning and game play. Other areas are mostly not enhanced byinteractivity: here we nd the traditional lm and storytelling genres, wherewe are expected to travel in a prescribed direction to perceive the message ina sequential fashion.

  • Computational Intelligence in Multimedia Processing 5

    Current research of multimedia processing is shifting from coding (MPEG-1,2,4) to automatic recognition (MPEG-7). Its research domain will cover tech-niques for object-based representation and coding; segmentation and tracking;pattern detection and recognition; multimodal signals fusion, conversion andsynchronization, as well as content-based indexing and subject-based retrievaland browsing.

    Multimedia processing is a very important scientic research domain witha broad range of applications. The development of new insights and applica-tions results from both fundamental scientic research and the development ofnew technologies. One of these emerging technologies is computational intelli-gence, which is a generic term for a specic collection of tools to model uncer-tainty, imprecision, evolutionary behavior and complex models. This chapterwill be a comprehensive view of modern computational intelligence theory inthe eld of multimedia processing.

    The objective of this book chapter is to present to the computationalintelligence techniques and multimedia processing research communities thestate of the art in the computational intelligence applications to multime-dia processing and motivate research in new trend-setting directions. Hence,we review and discuss in the following Sections some representative methodsto provide inspiring examples to illustrate how CI techniques could be ap-plied to resolve multimedia problems and how multimedia could be analyzed,processed, and characterized by computational intelligence. These representa-tive examples include (1) Computational Intelligence for speech, audio, imageand video processing, (2) CI in audiovisual recognition systems, (3) Compu-tational Intelligence in multimedia watermarking, and (4) CI in multimediacontent-based indexing and retrieval.

    To provide useful insights for CI applications in multimedia processing,we structure the rest of this chapter into ve Sections. Section 2 introducesthe fundamental aspects of the key components of modern computational in-telligence including neural networks, rough sets, fuzzy sets, particle swarmoptimization algorithm, evolutionary algorithm and near sets. Section 3 re-views some past literature in using the computational intelligence in speech,audio, and image processing, as well as in speech emotion recognition andaudiovisual recognition systems. A review of the current literature on com-putational intelligence based approaches in video processing problems suchas video segmentation as well as adaptation of c-means clustering algorithmto rough set theory in solving multimedia segmentation and clustering prob-lems is presented in Sect. 4. Section 5 reviews and discuss some successfulwork to illustrate how CI could be applied to multimedia watermarking prob-lems. Computational intelligence in content-based multimedia indexing andretrieval is reviewed in Sect. 6. Challenges and future trends are addressedand presented in Sect. 7.

  • 6 A.-E. Hassanien et al.

    2 Computational Intelligence: Foundations

    In the following subsection, we present an overview of the modern computa-tional intelligence techniques with their advantages including neural networks,fuzzy sets, particle swarm optimization, genetic algorithm, rough sets and nearsets.

    2.1 Articial Neural Networks

    Articial neural networks have been developed as generalizations of mathe-matical models of biological nervous systems. In a simplied mathematicalmodel of the neuron, the eects of the synapses are represented by connectionweights that modulate the eect of the associated input signals, and the non-linear characteristic exhibited by neurons is represented by a transfer function.There are a range of transfer functions developed to process the weighted andbiased inputs, among which four basic transfer functions widely adopted formultimedia processing are illustrated in Fig. 1.

    The neuron impulse is then computed as the weighted sum of the inputsignals, transformed by the transfer function. The learning capability of anarticial neuron is achieved by adjusting the weights in accordance to thechosen learning algorithm. Most applications of neural networks fall into thefollowing categories:

    Prediction: Use input values to predict some output Classication: Use input values to determine the classication Data association: Like classication but it also recognizes data that con-

    tains errors Data conceptualization: Analyze the inputs so that grouping relationships

    can be inferred

    Mathematical Modeling and Learning in Neural Networks

    A typical multilayered neural network and an articial neuron are illustratedin Fig. 2.

    Fig. 1. Basic transfer functions

  • Computational Intelligence in Multimedia Processing 7

    Fig. 2. Typical multilayered neural network

    Each neuron is characterized by an activity level (representing the stateof polarization of a neuron), an output value (representing the ring rate ofthe neuron), a set of input connections, (representing synapses on the celland its dendrite), a bias value (representing an internal resting level of theneuron), and a set of output connections (representing a neurons axonal pro-jections). Each of these aspects of the unit is represented mathematicallyby real numbers. Thus each connection has an associated weight (synapticstrength), which determines the eect of the incoming input on the activationlevel of the unit. The weights may be positive or negative. Referring to Fig. 2,the signal ow from inputs {x1, . . . , xn} is considered to be unidirectional in-dicated by arrows, as is a neurons output signal ow (O). The neuron outputsignal O is given by the following relationship:

    O = f(net) = f




    ), (1)

    where wj is the weight vector and the function f(net) is referred to as anactivation (transfer) function. The variable net is dened as a scalar productof the weight and input vectors

    net = wTx = w1x1 + + wnxn , (2)where T is the transpose of a matrix. A typical Gaussian and logistic activationfunction is plotted in Fig. 3.

    Neural Network Architecture

    The behavior of the neural network depends largely on the interaction be-tween the dierent neurons. The basic architecture consists of three types ofneuron layers: input, hidden and output layers. In feed-forward networks, thesignal ow is from input to output units strictly in a feed-forward direction.The data processing can extend over multiple (layers of) units, but no feed-back connections are present, that is, connections extending from outputs of

  • 8 A.-E. Hassanien et al.

    Fig. 3. Typical Gaussian and logistic activation function

    units to inputs of units in the same layer or previous layers. Recurrent net-works contain feedback connections. Contrary to feed-forward networks, thedynamical properties of the network are important. In some cases, the acti-vation values of the units undergo a relaxation process such that the networkwill evolve to a stable state in which these activations do not change anymore.In other applications, the changes of the activation values of the output neu-rons are signicant, such that the dynamical behavior constitutes the outputof the network. There are several other neural network architectures (Elmannetwork, adaptive resonance theory maps, competitive networks, etc.) depend-ing on the properties and requirement of the application. Reader may referto [2] for an extensive overview of the dierent neural network architecturesand learning algorithms. A neural network has to be congured such that theapplication of a set of inputs produces the desired set of outputs. Variousmethods to set the strengths of the connections exist. One way is to set theweights explicitly, using a priori knowledge. Another way is to train the neuralnetwork by feeding it teaching patterns and letting it change its weights ac-cording to some learning rule. The learning situations in neural networks maybe classied into three distinct sorts. These are supervised learning, unsuper-vised learning and reinforcement learning. In supervised learning, an inputvector is presented at the inputs together with a set of desired responses, one

  • Computational Intelligence in Multimedia Processing 9

    for each node, at the output layer. A forward pass is done and the errors ordiscrepancies, between the desired and actual response for each node in theoutput layer, are found. These are then used to determine weight changes inthe net according to the prevailing learning rule. The term supervised orig-inates from the fact that the desired signals on individual output nodes areprovided by an external teacher. The best-known examples of this techniqueoccur in the backpropagation algorithm, the delta rule and perceptron rule.In unsupervised learning (or self-organization) a (output) unit is trained torespond to clusters of pattern within the input. In this paradigm the systemis supposed to discover statistically salient features of the input population.Unlike the supervised learning paradigm, there is no a priori set of categoriesinto which the patterns are to be classied; rather the system must developits own representation of the input stimuli. Reinforcement learning is learningwhat to do how to map situations to actions so as to maximize a numericalreward signal. The learner is not told which actions to take, as in most formsof machine learning, but instead must discover which actions yield the mostreward by trying them. In the most interesting and challenging cases, actionsmay aect not only the immediate reward, but also the next situation and,through that, all subsequent rewards. These two characteristics, trial-and-error search and delayed reward are the two most important distinguishingfeatures of reinforcement learning.

    Major Neural Network Architecture and Learning Models

    Via selection of transfer function and connection of neurons, various neuralnetworks can be constructed to be trained for producing the specied outputs.Major neural networks that are commonly used for multimedia applicationsare classied as feed-forward neural network, feedback network or recurrent,self-organizing map and Adaptive Resonance Theory (ART) networks. Thelearning paradigms for the neural networks in multimedia processing gener-ally include supervised networks and unsupervised networks. In supervisedtraining, the training data set consists of many pairs in the source and targetpatterns. The network processes the source inputs and compares the resultingoutputs against the target outputs, and adjusts its weights to improve thecorrect rate of the resulting outputs. In unsupervised networks, the trainingdata set does not include any target information.

    Feed-Forward Neural Network

    A general Feed-forward network often consists of multiple layers, typicallyincluding one input layer, a number of hidden layers, and an output layer.In the feed-forward neural networks, the neuron in each layer are only fullyinterconnected with the neurons in the next layer, which means signals orinformation being processed travel along a single direction. Back-propagation(BP) network is a supervised feed-forward neural network and it is a simple

  • 10 A.-E. Hassanien et al.

    stochastic gradient descent method to minimize the total squared error of theoutput computed by the neural network. Its errors propagate backwards fromthe output neurons to the inner neurons. The processes of adjusting the setof weights between the layers and recalculating the output continue until astopping criterion is satised. The Radial basis function (RBF) network isa three-layer supervised feed-forward network that uses a nonlinear transferfunction (normally the Gaussian) for the hidden neurons and a linear transferfunction for the output neurons. The Gaussian function is usually appliedto the net input to produce a radial function of the distance between eachpattern vector and each hidden unit weight vector.

    Recurrent Networks

    Recurrent networks are the state-of-the-art in nonlinear time series predic-tion, system identication, and temporal pattern classication. As the out-put of the network at time t is used along with a new input to computethe output of the network at time t + 1, the response of the network is dy-namic. Time-Lag Recurrent Networks (TLRN) are multi-layered perceptronsextended with short-term memory structures that have local recurrent con-nections. The TLRN is a very appropriate model for processing temporal(time-varying) information. Examples of temporal problems include time se-ries prediction, system identication and temporal pattern recognition. Thetraining algorithm used with TLRNs (backpropagation through time) is moreadvanced than standard backpropagation algorithm. The main advantage ofTLRNs is the smaller network size required to learn temporal problems whencompared to MLP that use extra inputs to represent the past samples (equiv-alent to time delay neural networks). An added advantage of TLRNs is theirlow sensitivity to noise.

    Self Organizing Feature Maps

    Self Organizing Feature Maps (SOFM) are a data visualization techniqueproposed by Kohonen [3], which reduce the dimensions of data through theuse of self-organizing neural networks. A SOFM learns the categorization,topology and distribution of input vectors. SOFM allocate more neurons torecognize parts of the input space where many input vectors occur and allocatefewer neurons to parts of the input space where few input vectors occur.Neurons next to each other in the network learn to respond to similar vectors.SOFM can learn to detect regularities and correlations in their input andadapt their future responses to that input accordingly. An important featureof SOFM learning algorithm is that it allow neurons that are neighbors tothe winning neuron to output values. Thus the transition of output vectors ismuch smoother than that obtained with competitive layers, where only oneneuron has an output at a time. The problem that data visualization attemptsto solve is that humans simply cannot visualize high dimensional data. Theway SOFM go about reducing dimensions is by producing a map of usually

  • Computational Intelligence in Multimedia Processing 11

    1 or 2 dimensions, which plot the similarities of the data by grouping similardata items together (data clustering). In this process, SOFM accomplish twothings, they reduce dimensions and display similarities. It is important to notethat while a self-organizing map does not take long to organize itself so thatneighboring neurons recognize similar inputs, it can take a long time for themap to nally arrange itself according to the distribution of input vectors.

    Adaptive Resonance Theory

    Adaptive Resonance Theory (ART) was initially introduced by Grossberg [5]as a theory of human information processing. ART neural networks are exten-sively used for supervised and unsupervised classication tasks and functionapproximation. There are many dierent variations of ART networks avail-able today [4]. For example, ART1 performs unsupervised learning for binaryinput patterns, ART2 is modied to handle both analog and binary input pat-terns, and ART3 performs parallel searches of distributed recognition codesin a multilevel network hierarchy. ARTMAP combines two ART modules toperform supervised learning while fuzzy ARTMAP represents a synthesis ofelements from neural networks, expert systems, and fuzzy logic.

    2.2 Rough Sets

    Rough set theory [7577,87] is a fairly new intelligent technique for managinguncertainty that has been applied to the medical domain and is used for thediscovery of data dependencies, evaluates the importance of attributes, discov-ers the patterns of data, reduces all redundant objects and attributes, seeksthe minimum subset of attributes, recognize and classify objects in medicalimaging. Moreover, it is being used for the extraction of rules from data-bases. Rough sets have proven useful for representation of vague regions inspatial data. One advantage of the rough set is the creation of readable ifthenrules. Such rules have a potential to reveal new patterns in the data mate-rial; furthermore, it also collectively functions as a classier for unseen datasets. Unlike other computational intelligence techniques, rough set analysisrequires no external parameters and uses only the information presented inthe given data. One of the nice features of rough sets theory is that its cantell whether the data is complete or not based on the data itself. If the datais incomplete, it suggests more information about the objects needed to becollected in order to build a good classication model. On the other hand, ifthe data is complete, rough sets can determine whether there are more thanenough or redundant information in the data and nd the minimum dataneeded for classication model. This property of rough sets is very importantfor applications where domain knowledge is very limited or data collection isvery expensive/laborious because it makes sure the data collected is just goodenough to build a good classication model without sacricing the accuracy ofthe classication model or wasting time and eort to gather extra informationabout the objects [7577,87].

  • 12 A.-E. Hassanien et al.

    In rough sets theory, the data is collected in a table, called decision table.Rows of the decision table correspond to objects, and columns correspond toattributes. In the data set, we assume that the a set of examples with a classlabel to indicate the class to which each example belongs are given. We callthe class label the decision attributes, the rest of the attributes the conditionattributes. Rough sets theory denes three regions based on the equivalentclasses induced by the attribute values: lower approximation, upper approxi-mation and boundary. Lower approximation contains all the objects, which areclassied surely based on the data collected, and upper approximation con-tains all the objects which can be classied probably, while the boundary isthe dierence between the upper approximation and the lower approximation.So, we can dene a rough set as any set dened through its lower and upperapproximations. On the other hand, indiscernibility notion is fundamental torough set theory. Informally, two objects in a decision table are indiscernible ifone cannot distinguish between them on the basis of a given set of attributes.Hence, indiscernibility is a function of the set of attributes under consider-ation. For each set of attributes we can thus dene a binary indiscernibilityrelation, which is a collection of pairs of objects that are indiscernible to eachother. An indiscernibility relation partitions the set of cases or objects into anumber of equivalence classes. An equivalence class of a particular object issimply the collection of objects that are indiscernible to the object in question.Here we provide an explanation of the basic framework of rough set theory,along with some of the key denitions. A review of this basic material can befound in sources such as [7477,87] and many others.

    2.3 Near Sets: Generalization of the Rough Set in MultimediaProcessing

    Near sets [67, 7881, 83] oer a generalization of traditional rough set the-ory [8488] and a new approach to classifying perceptual objects by means offeatures [8994]. The near set approach can be used to classify images thatare qualitatively but not necessary quantitatively close to each other. This isessentially the idea expressed in classifying images in [67, 81]. If one adoptsthe near set approach in image processing, a byproduct of the approach is theseparation of images into non-overlapping sets of images that are similar (de-scriptively near to) each other. This has recently led to an application of thenear set approach in 2D and 3D interactive gaming with a vision system thatlearns and serves as the backbone for an adaptive telerehabilitation system forpatients with nger, hand, arm and balance disabilities (see, e.g., [100, 101]).Each remote node in the telerehabilitation system includes a vision systemthat learns to track the behavior of a patient. Images deemed to be interest-ing (e.g., images representing erratic behavior) are stored as well as forwardedto a rehabilitation center for followup. In such a system, there is a need toidentify images that are in some sense near images representing some standardor norm. This research has led to a study of methods of automating image

  • Computational Intelligence in Multimedia Processing 13

    segmentation as a rst step in near set-based image processing. This sectionis limited to a very brief introduction to near sets and near images useful inimage pattern recognition.

    Object Description

    Perceptual objects that have the same appearance are considered qualitativelynear each other, i.e., objects with matching descriptions. A description is atuple of values of functions representing features of an object [79]. For sim-plicity, assume the description of an object consists of one function value. Forexample, let w I, w I be nm pixel windows contained in two imagesI, I and (w) = information content of pixel window w, where informationcontent is a feature of a pixel window and is a sample function representinginformation content dened in the usual way [99]. Then pixel window w isnear pixel window w if (w) = (w).

    Near Objects

    Objects are known by their descriptions. An object description is dened bymeans of a tuple of function values (x) associated with an object x X.Assume that B F is a given set of functions representing features of sampleobjects X O. Let i B, where i : O . In combination, thefunctions representing object features provide a basis for an object description : O L, a vector containing measurements (returned values) associatedwith each functional value i (x) in (3), where the description length || = L.

    Object Description: (x) = (1(x), 2(x), . . . , i(x), . . . , L(x)). (3)

    The intuition underlying a description (x) is a recording of measurementsfrom sensors, where each sensor is modeled by a function i. Then let idenote

    i = i(x) i(x),where x, x O. The dierence leads to a denition of the indiscernibilityrelation B introduced by Pawlak [86] (see Denition 1).Denition 1. Indiscernibility Relation Let x, x O, B F .

    B= {(x, x) O O | i B i = 0} ,where i || (description length).

    Near Sets

    The basic idea in the near set approach to object recognition is to compareobject descriptions. Sets of objects X,X are considered near each other if thesets contain objects with at least partial matching descriptions.

  • 14 A.-E. Hassanien et al.

    Denition 2. Near Sets Let X, X O, B F . Set X is near X if, andonly if there exists x X,x X , i B such that x {i} x.For example, assume that a pair of images I, I , where a pixel window inimage I has a description that matches the description of a pixel window inimage I . The objects in this case are pixel windows. By denition, I, I arenear sets and, from an image classication perspective, I, I are near images.Object recognition problems, especially in images [67], and the problem ofthe nearness of objects have motivated the introduction of near sets (see,e.g., [81, 83]).

    Near Images

    In the context of image processing, the relation B in Denition 1 is importantbecause it suggests a way to classify images by a number of straightforwardsteps: (1) identify an image object, e.g., pixel window, (2) select a set Bcontaining functions representing features of an image object such as a pixelwindow, (3) partition each image using B and then compare a representativeobject from a class in each partition. In the case where one discovers that theobjects in the selected classes have matching descriptions, then this meansthe images are near each other at the class level. In eect, if near images arediscovered, this means a pair of sample images have been eectively classied.This is important because it leads to eective image segmentation method.

    2.4 Fuzzy Sets

    Zadeh [115] introduced the concept of fuzzy logic to present vagueness inlinguistics, and further implement and express human knowledge and inferencecapability in a natural way. Fuzzy logic starts with the concept of a fuzzy set.A fuzzy set is a set without a crisp, clearly dened boundary. It can containelements with only a partial degree of membership. A Membership Function(MF) is a curve that denes how each point in the input space is mapped to amembership value (or degree of membership) between 0 and 1. The input spaceis sometimes referred to as the universe of discourse. Let X be the universeof discourse and x be a generic element of X. A classical set A is denedas a collection of elements or objects x X, such that each x can eitherbelong to or not belong to the set A, A X. By dening a characteristicfunction (or membership function) on each element x in X, a classical set Acan be represented by a set of ordered pairs (x, 0) or (x, 1), where 1 indicatesmembership and 0 non-membership. Unlike conventional set mentioned abovefuzzy set expresses the degree to which an element belongs to a set. Hencethe characteristic function of a fuzzy set is allowed to have value between 0and 1, denoting the degree of membership of an element in a given set. If Xis a collection of objects denoted generically by x, then a fuzzy set A in X isdened as a set of ordered pairs:

  • Computational Intelligence in Multimedia Processing 15

    Fig. 4. Shapes of two commonly used MFs

    A = {(x, A(x)) | x X}, (4)A(x) is called the membership function of linguistic variable x in A, whichmaps X to the membership space M , M = [0, 1], where M contains onlytwo points 0 and 1, A is crisp and A(x) is identical to the characteristicfunction of a crisp set. Triangular and trapezoidal membership functions arethe simplest membership functions formed using straight lines. Some of theother shapes are Gaussian, generalized bell, sigmoidal and polynomial basedcurves.

    Figure 4 illustrates the shapes of two commonly used MFs. The mostimportant thing to realize about fuzzy logical reasoning is the fact that it isa superset of standard Boolean logic.

    Fuzzy Logic Operators

    It is interesting to note about the correspondence between two-valued andmulti-valued logic operations for AND, OR, and NOT. It is possible to re-solve the statement A AND B, where A and B are limited to the range (0,1),by using the operator minimum (A,B). Using the same reasoning, we canreplace the OR operation with the maximum operator, so that A OR B be-comes equivalent to maximum (A,B). Finally, the operation NOT A becomesequivalent to the operation 1 A. In fuzzy logic terms these are popularlyknown as fuzzy intersection or conjunction (AND), fuzzy union or disjunction(OR), and fuzzy complement (NOT). The intersection of two fuzzy sets Aand B is specied in general by a binary mapping T , which aggregates twomembership functions as follows.

    AB(x) = T (A(x), B(x)) (5)

    The fuzzy intersection operator is usually referred to as T -norm (Trian-gular norm) operator. The fuzzy union operator is specied in general by abinary mapping S.

    AB(x) = S(A(x), B(x)) (6)

    This class of fuzzy union operators are often referred to as T -conorm (orS-norm) operators 5.

  • 16 A.-E. Hassanien et al.

    Ifthen Rules and Fuzzy Inference Systems

    The fuzzy rule base is characterized in the form of ifthen rules in whichpreconditions and consequents involve linguistic variables. The collection ofthese fuzzy rules forms the rule base for the fuzzy logic system. Due to theirconcise form, fuzzy ifthen rules are often employed to capture the imprecisemodes of reasoning that play an essential role in the human ability to makedecisions in an environment of uncertainty and imprecision. A single fuzzyifthen rule assumes the form:

    if x is A then y is B,

    where A and B are linguistic values dened by fuzzy sets on the ranges (uni-verses of discourse) X and Y, respectively. The if -part of the rule x is A iscalled the antecedent (pre-condition) or premise, while the then-part of therule y is B is called the consequent or conclusion. Interpreting an ifthen ruleinvolves evaluating the antecedent (fuzzication of the input and applying anynecessary fuzzy operators) and then applying that result to the consequent(known as implication). For rules with multiple antecedents, all parts of theantecedent are calculated simultaneously and resolved to a single value usingthe logical operators. Similarly all the consequents (rules with multiple con-sequents) are aected equally by the result of the antecedent. The consequentspecies a fuzzy set be assigned to the output. The implication function thenmodies that fuzzy set to the degree specied by the antecedent. For multiplerules, the output of each rule is a fuzzy set. The output fuzzy sets for eachrule are then aggregated into a single output fuzzy set. Finally the resultingset is defuzzied, or resolved to a single number. The defuzzication interfaceis a mapping from a space of fuzzy actions dened over an output universeof discourse into a space of non-fuzzy actions, because the output from theinference engine is usually a fuzzy set while for most practical applicationscrisp values are often required. The three commonly applied defuzzicationtechniques are, max-criterion, center-of-gravity and the mean- of- maxima.The max-criterion is the simplest of these three to implement. It producesthe point at which the possibility distribution of the action reaches a max-imum value. Reader may please refer to [7] for more information related tofuzzy systems. It is typically advantageous if the fuzzy rule base is adaptiveto a certain application. The fuzzy rule base is usually constructed manuallyor by automatic adaptation by some learning techniques using evolutionaryalgorithms and/or neural network learning methods [6].

    Fuzzy Image Processing

    The adoption of the fuzzy paradigm is desirable in image processing becauseof the uncertainty and imprecision present in images, due to noise, image sam-pling, lightning variations and so on. Fuzzy theory provides a mathematical

  • Computational Intelligence in Multimedia Processing 17

    tool to deal with the imprecision and ambiguity in an elegant and ecientway. Fuzzy techniques can be applied to dierent phases of the segmentationprocess; additionally, fuzzy logic allows to represent the knowledge about thegiven problem in terms of linguistic rules with meaningful variables, whichis the most natural way to express and interpret information. Fuzzy imageprocessing [10, 68, 73, 102, 112] is the collection of all approaches that under-stand, represent and process the images, their segments and features as fuzzysets. An image I of size MN and L gray levels can be considered as an arrayof fuzzy singletons, each having a value of membership denoting its degree ofbrightness relative to some brightness levels. For an image I, we can write inthe notation of fuzzy sets:

    I =MN


    , (7)

    where gmn is the intensity of (m,n) the pixel and mn its membership value.The membership function characterizes a suitable property of image (e.g.,

    edginess, darkness, textural property) and can be dened globally for thewhole image or locally for its segments. In recent years, some researchershave applied the concept of fuzziness to develop new algorithms for imageprocessing tasks, for example image enhancement, segmentation, etc. Fuzzyimage processing system is a rule-based system that uses fuzzy logic to reasonabout image data. Its basic structure consists of four main components, asdepicted in Fig. 5.

    Fig. 5. Fuzzy image processing system [10]

  • 18 A.-E. Hassanien et al.

    The coding of image data (fuzzier), which translates gray-level plane tothe membership plane

    An inference engine that applies a fuzzy reasoning mechanism to obtain afuzzy output

    Decoding the result of fuzzication (defuzzier), which translates this lat-ter output into a gray-level plane; and

    Knowledge base, which contains both an ensemble of fuzzy rules, knownas the rule base, and an ensemble of membership functions known as thedatabase

    The decision-making process is performed by the inference engine usingthe rules contained in the rule base. These fuzzy rules dene the connectionbetween input and output fuzzy variables. The inference engine evaluates allthe rules in the rule base and combines the weighted consequents of all relevantrules into a single output fuzzy set.

    2.5 Evolutionary Algorithms

    Evolutionary algorithms (EA) are adaptive methods, which may be used tosolve search and optimization problems, based on the genetic processes ofbiological organisms. Over many generations, natural populations evolve ac-cording to the principles of natural selection and survival of the ttest, rstclearly stated by Charles Darwin in The Origin of Species. By mimicking thisprocess, evolutionary algorithms are able to evolve solutions to real worldproblems, if they have been suitably encoded [12]. Usually grouped underthe term evolutionary algorithms or evolutionary computation, we nd thedomains of genetic algorithms [15, 16], evolution strategies [21], evolutionaryprogramming [11], genetic programming [18] and learning classier systems.They all share a common conceptual base of simulating the evolution of indi-vidual structures via processes of selection, mutation, and reproduction. Theprocesses depend on the perceived performance of the individual structuresas dened by the environment (problem).

    EAs deal with parameters of nite length, which are coded using a -nite alphabet, rather than directly manipulating the parameters themselves.This means that the search is unconstrained neither by the continuity of thefunction under investigation, nor the existence of a derivative function.

    Figure 6 depicts the functional block diagram of a Genetic Algorithm (GA)and the various aspects are discussed below. It is assumed that a potentialsolution to a problem may be represented as a set of parameters. These para-meters (known as genes) are joined together to form a string of values (knownas a chromosome). A gene (also referred to a feature, character or detector)refers to a specic attribute that is encoded in the chromosome. The particularvalues the genes can take are called its alleles. The position of the gene in thechromosome is its locus. Encoding issues deal with representing a solution in achromosome and unfortunately, no one technique works best for all problems.

  • Computational Intelligence in Multimedia Processing 19

    Fig. 6. The functional block diagram of a genetic algorithm

    A tness function must be devised for each problem to be solved. Given a par-ticular chromosome, the tness function returns a single numerical tness orgure of merit, which will determine the ability of the individual, which thatchromosome represents. Reproduction is the second critical attribute of GAswhere two individuals selected from the population are allowed to mate to pro-duce ospring, which will comprise the next generation. Having selected twoparents, their chromosomes are recombined, typically using the mechanismsof crossover and mutation.

    There are many ways in which crossover can be implemented. In a sin-gle point crossover two chromosome strings are cut at some randomly chosenposition, to produce two head segments, and two tail segments. The tailsegments are then swapped over to produce two new full-length chromosomes.Crossover is not usually applied to all pairs of individuals selected for mat-ing. Another genetic operation is mutation, which is an asexual operationthat only operates on one individual. It randomly alters each gene with asmall probability. Traditional view is that crossover is the more important ofthe two techniques for rapidly exploring a search space. Mutation provides asmall amount of random search, and helps ensure that no point in the searchspace has a zero probability of being examined. If the GA has been correctlyimplemented, the population will evolve over successive generations so thatthe tness of the best and the average individual in each generation increasestowards the global optimum. Selection is the survival of the ttest withinGAs. It determines which individuals are to survive to the next generation.The selection phase consists of three parts. The rst part involves determi-nation of the individuals tness by the tness function. A tness functionmust be devised for each problem; given a particular chromosome, the tnessfunction returns a single numerical tness value, which is proportional to theability, or utility, of the individual represented by that chromosome. For manyproblems, deciding upon the tness function is very straightforward, for ex-ample, for a function optimization search; the tness is simply the value ofthe function. Ideally, the tness function should be smooth and regular sothat chromosomes with reasonable tness are close in the search space, tochromosomes with slightly better tness. However, it is not always possibleto construct such ideal tness functions. The second part involves convertingthe tness function into an expected value followed by the last part wherethe expected value is then converted to a discrete number of ospring. Some

  • 20 A.-E. Hassanien et al.

    of the commonly used selection techniques are roulette wheel and stochas-tic universal sampling. Genetic programming applies the GA concept to thegeneration of computer programs. Evolution programming uses mutations toevolve populations. Evolution strategies incorporate many features of the GAbut use real-valued parameters in place of binary-valued parameters. Learn-ing classier systems use GAs in machine learning to evolve populations ofcondition/action rules.

    2.6 Intelligent Paradigms: Probabilistic Computing and SwarmIntelligence

    Probabilistic models are viewed as similar to that of a game, actions are basedon expected outcomes. The center of interest moves from the deterministic toprobabilistic models using statistical estimations and predictions. In the prob-abilistic modeling process, risk means uncertainty for which the probabilitydistribution is known. Therefore risk assessment means a study to determinethe outcomes of decisions along with their probabilities. Decision-makers of-ten face a severe lack of information. Probability assessment quanties theinformation gap between what is known, and what needs to be known foran optimal decision. The probabilistic models are used for protection againstadverse uncertainty, and exploitation of propitious uncertainty.

    Swarm intelligence is aimed at collective behavior of intelligent agents indecentralized systems. Although there is typically no centralized control dic-tating the behavior of the agents, local interactions among the agents oftencause a global pattern to emerge. Most of the basic ideas are derived fromthe real swarms in the nature, which includes ant colonies, bird ocking, hon-eybees, bacteria and microorganisms, etc. Ant Colony Optimization (ACO),have already been applied successfully to solve several engineering optimiza-tion problems. Swarm models are population-based and the population isinitialised with a population of potential solutions. These individuals are thenmanipulated (optimised) over many several iterations using several heuristicsinspired from the social behavior of insects in an eort to nd the optimalsolution. Ant colony algorithms are inspired by the behavior of natural antcolonies, in the sense that they solve their problems by multi agent cooperationusing indirect communication through modications in the environment. Antsrelease a certain amount of pheromone (hormone) while walking, and each antprefers (probabilistically) to follow a direction, which is rich of pheromone.This simple behavior explains why ants are able to adjust to changes in theenvironment, such as optimizing shortest path to a food source or a nest. InACO, ants use information collected during past simulations to direct theirsearch and this information is available and modied through the environ-ment. Recently ACO algorithms have also been used for clustering data sets.

  • Computational Intelligence in Multimedia Processing 21

    3 Computational Intelligence on Speech, Audioand Image Processing

    Computational intelligence techniques are being used for processing speech,audio and image for several years [59, 64, 98]. Some of the applications inspeech processing where computational intelligences are extensively used in-clude speech recognition, speaker recognition, speech enhancement, speechcoding and speech synthesis; in audio processing, computational intelligenceare used for speech/music classication, audio classication and audio in-dexing and retrieval; while in image processing include image enhancement,segmentation, classication, registration, motion detection, etc. For example,Vladimir et al. [17] proposed a fuzzy logic recursive scheme for motion de-tection and spatiotemporal ltering that can deal with the Gaussian noiseand unsteady illumination conditions in both the temporal and spatial di-rections. Our focus is on applications concerning tracking and de-noising ofimage sequences. An input noisy sequence is processed with fuzzy logic mo-tion detection to determine the degree of motion condence. The proposedmotion detector combines the membership of the temporal intensity changes,appropriately using fuzzy rules, where the membership degree of motion foreach pixel in a 2D sliding window is determined by a proposed membershipfunction. Both the fuzzy membership function and the fuzzy rules are de-ned in such a way that the performance of the motion detector is optimizedin terms of its robustness to noise and unsteady lighting conditions. Track-ing and recursive adaptive temporal ltering are simultaneously performed,where the amount of ltering is inversely proportional to the condence in theexistence of motion. Finally, temporally ltered frames are further processedby a proposed spatial lter to obtain a de-noised image sequence. The pro-posed motion detection algorithm have been evaluated using two criteria: (1)robustness to noise and to changing illumination conditions and (2) motionblur in temporal recursive de-noising.

    Speech and Audio Processing

    Speech processing is the study of speech signals and the processing methodsof these signals. The signals are usually processed in a digital representationwhereby speech processing can be seen as the intersection of digital signalprocessing and natural language processing.It can be divided in the followingcategories: (1) Speech recognition, which deals with analysis of the linguis-tic content of a speech signal; (2) Speaker recognition, where the aim is torecognize the identity of the speaker; (3) Enhancement of speech signals, e.g.,audio noise reduction, Speech coding, a specialized form of data compression,is important in the telecommunication area; (4) Voice analysis for medicalpurposes, such as analysis of vocal loading and dysfunction of the vocal cords;(5) Speech synthesis (i.e., the articial synthesis of speech), which usuallymeans computer generated speech; and (6) Speech enhancement, which deals

  • 22 A.-E. Hassanien et al.

    with enhancing the perceptual quality of speech signal by removing the de-structive eects of noise, limited capacity recording equipment, impairments,etc. Reader may refer to [64] for an extensive overview of the advances onpattern recognition for speech and audio processing.

    The feasibility of converting text into speech using an inexpensive com-puter with minimal memory is of great interest. Speech synthesizers havebeen developed for many popular languages (e.g., English, Chinese, Span-ish, French, etc.), but designing a speech synthesizer for a language is largelydependant on the language structure. Text-to-speech conversion has tradi-tionally been performed either by concatenating short samples of speech orby using rule-based systems to convert a phonetic representation of speechinto an acoustic representation, which is then converted into speech. Karaaliet al. [56] described a system that uses a Time-Delay Neural Network (TDNN)to perform this phonetic-to-acoustic mapping, with another neural networkto control the timing of the generated speech. The neural network system re-quires less memory than a concatenation system, and performed well in testscomparing it to commercial systems using other technologies. It is reportedthat the neural network approach to speech synthesis oers the benets oflanguage portability, natural sounding speech, and low storage requirementsas well as provide better voice quality than traditional approaches.

    Hendessi et al. [55] developed a Persian synthesizer that includes an in-novative text analyzer module. In the synthesizer, the text is segmented intowords and after preprocessing, a neural network is passed over each word. Inaddition to preprocessing, a new model (SEHMM) is used as a post-processorto compensate for errors generated by the neural network. The performanceof the proposed model is veried and the intelligibility of the synthetic speechis assessed via listening tests.

    The use of neural networks to synthesize speech from a phonetic represen-tation and to generate a frame of input to a vocoder. This requires the neuralnetwork to compute one output for each frame of speech from the vocoder,this can be computationally expensive. Corrigan et al. [57] introduced an al-ternative implementation to model the speech as a series of gestures, andlet the neural network generate parameters describing the transitions of thevocoder parameters during these gestures. Their experiments have shown thatacceptable speech quality is produced when each gesture is half of a phoneticsegment and the transition model is a set of cubic polynomials describingthe variation of each vocoder parameter during the gesture. Empirical resultsreveal a signicant reduction in the computational cost.

    Frankel et al. [60] described a speech recognition system which uses ar-ticulatory parameters as basic features and phone-dependent linear dynamicmodels. The system rst estimates articulatory trajectories from the speechsignal. Estimations of x and y coordinates of seven actual articulator posi-tions in the midsagittal plane are produced every 2ms by a recurrent neuralnetwork, trained on real articulatory data. The output of this network is thenpassed to a set of linear dynamic models, which perform phone recognition.

  • Computational Intelligence in Multimedia Processing 23

    In recent years, the features derived from posteriors of a Multilayer Per-ceptron (MLP), known as tandem features, have proven to be very eectivefor automatic speech recognition. Most tandem features to date have reliedon MLPs trained for phone classication. Cetin et al. [105] illustrated on arelatively small data set that MLPs trained for articulatory feature classica-tion can be equally eective. They provided a similar comparison using MLPstrained on a much larger data set 2,000 h of English conversational tele-phone speech. Also, authors explored how portable phone- and articulatoryfeature- based tandem features are in an entirely dierent language Man-darin without any retraining. It is reported that while phone-based featuresperform slightly better in the matched-language condition, they perform sig-nicantly better in the cross-language condition. Yet, in the cross-languagecondition, neither approach is as eective as the tandem features extractedfrom an MLP trained on a relatively small amount of in-domain data. Be-yond feature concatenation, Cerin et al. explored novel observation modelingschemes that allow for greater exibility in combining the tandem and stan-dard features at hidden Markov model (HMM) outputs.

    Halavati et al. [42] presents a novel approach to speech recognition usingfuzzy modeling. The task begins with conversion of speech spectrogram into alinguistic description based on arbitrary colors and lengths. While phonemesare also described using these fuzzy measures, and recognition is done bynormal fuzzy reasoning, a genetic algorithm optimizes phoneme denitions sothat to classify samples into correct phonemes. The method is tested over astandard speech data base and the results are presented.

    One of the factors complicating activity with speech signals is its largedegree of acoustic variability. To decrease inuence of acoustic variability ofspeech signals, it is oered to use genetic algorithms in speech processingsystems. Bovbel and Tsishkoual [43] constructed a model which implementsthe technology of speech recognition using genetic algorithms. They madeexperiments on their model with a database of separated Belarussian wordsand achieved optimal results.

    Ding [49] presented a fuzzy control mechanism for conventional MaximumLikelihood Linear Regression (MLLR) speaker adaptation, called FLC-MLLR,by which the eect of MLLR adaptation is regulated according to the avail-ability of adaptation data in such a way that the advantage of MLLR adap-tation could be fully exploited when the training data are sucient, or theconsequence of poor MLLR adaptation would be restrained otherwise. Therobustness of MLLR adaptation against data scarcity is thus ensured. It isreported that the proposed mechanism is conceptually simple and computa-tionally inexpensive and eective; the experiments in recognition rate showthat FLC-MLLR outperforms standard MLLR especially when encounteringdata insuciency and performs better than MAPLR at much less comput-ing cost.

    Kostek and Andrzej [47] discussed some limitations of the hearing-aid t-ting process. In the tting process, an audiologist performs tests on the wearer

  • 24 A.-E. Hassanien et al.

    of the hearing aid, which is then adjusted based on the results of the test,with the goal of making the device work as best as it can for that individual.Traditional tting procedures employ specialized testing devices which usearticial test signals. Ideally, however, the tting of hearing aids should alsosimulate real-world conditions, such as listening to speech in the presence ofbackground noise. Therefore, more satisfying and reliable tting tests may beachieved through the use of multimedia computers equipped with a properlycalibrated sound system. Kostek and Andrzej developed a new automatic sys-tem for tting hearing aids. It employed fuzzy logic and a computer makeschoices for adjusting the hearing aids settings by analyzing the patients re-sponses and answering questions with replies that can lie somewhere betweena simple yes or no.

    With the increase in access to multimedia computers, speech training canbe made available to patients with no continuous assistance required fromspeech therapists. Another function the system can easily perform is screeningtesting of speech uency providing directed information to patients who havevarious speech disorders and problems with understanding speech. Andrzejet al. [51] programmed speech therapy training algorithm consisting of diag-nostic tools and rehabilitation devices connected with it. The rst functionthe system has to perform is data acquisition where information about thepatients medical history is collected. This is done through electronic question-naires. The next function is analysis of the speech signal articulated by thepatient when prompted by the computer followed by some multimedia testscarried out in order to assess the subjects ability to understand speech. Next,the results of the electronic questionnaire, the patients voice and patients re-actions are automatically analyzed. Based on that the system automaticallydiagnoses possible speech disorders and how strong they are. A large numberof school children were tested and reported.

    The process of counting stuttering events could be carried out more ob-jectively through the automatic detection of stop-gaps, syllable repetitionsand vowel prolongations. The alternative would be based on the subjectiveevaluations of speech uency and may be dependent on a subjective evalu-ation method. Meanwhile, the automatic detection of intervocalic intervals,stop-gaps, voice onset time and vowel durations may depend on the speakerand the rules derived for a single speaker might be unreliable when trying toconsider them as universal ones. This implies that learning algorithms havingstrong generalization capabilities could be applied to solve the problem. Nev-ertheless, such a system requires vectors of parameters, which characterize thedistinctive features in a subjects speech patterns. In addition, an appropriateselection of the parameters and feature vectors while learning may augmentthe performance of an automatic detection system. Andrzej et al. [52] re-ported an automatic recognition of stuttered speech in normal and frequencyaltered feedback speech. It presents several methods of analyzing stutteredspeech and describes attempts to establish those parameters that represent

  • Computational Intelligence in Multimedia Processing 25

    stuttering event. It also reports results of some experiments on automaticdetection of speech disorder events that were based on both rough sets andarticial neural networks.

    Andrzej and Marek [54] presented a method for pitch estimation enhance-ment. Pitch estimation methods are widely used for extracting musical datafrom digital signal. A brief review of these methods is included in the pa-per. However, since processed signal may contain noise and distortions, theestimation results can be erroneous. The proposed method was developedin order to override disadvantages of standard pitch estimation algorithms.The introduced approach is based on both pitch estimation in terms of signalprocessing and pitch prediction based on musical knowledge modeling. First,signal is partitioned into segments roughly analogous to consecutive notes.Thereafter, for each segment an autocorrelation function is calculated. Au-tocorrelation function values are then altered using pitch predictor output.A music predictor based on articial neural networks was introduced for thistask. The description of the proposed pitch estimation enhancement methodis included and some details concerning music prediction are discussed.

    Liu et al. [48] proposed an improved hybrid support vector machine andduration distribution based hidden Markov (SVM/DDBHMM) decision fu-sion model for robust continuous digital speech recognition. The probabilityoutputs combination of Support Vector Machine and Gaussian mixture modelin pattern recognition (called FSVM), and embedding the fusion probabilityas similarity into the phone state level decision space of the Duration Distri-bution Based Hidden Markov Model (DDBHMM) speech recognition system(named FSVM/DDBHMM) were investigated. The performances of FSVMand FSVM/DDBHMM are demonstrated in Iris database and continuousmandarin digital speech corpus in four noise environments (white, volvo, bab-ble and destroyer-engine) from NOISEX-92. The experimental results showthe eectiveness of FSVM in Iris data, and the improvement of average worderror rate reduction of FSVM/DDBHMM from 6% to 20% compared withthe DDBHMM baseline at various signal noise ratios (SNRs) from 5 dB to30 dB by step of 5 dB.

    Andrzej [50] investigated methods for the identication of direction ofthe incoming acoustical signal in the presence of noise and reverberation.Since the problem is a non-deterministic one, thus applications of two learningalgorithms, namely neural networks and rough sets were developed to solveit. Consequently, two sets of parameters were formulated in order to discerntarget source from unwanted sound source position and then processed bylearning algorithms. The applied feature extraction methods are discussed,training processes are described and obtained sound source localizing resultsare demonstrated and compared.

    Kostek et al. [53] presented an automatic singing voice recognition us-ing neural network and rough sets. For this purpose a database containingsingers sample recordings has been constructed and parameters are extractedfrom recorded voices of trained and untrained singers of various voice types.

  • 26 A.-E. Hassanien et al.

    Parameters, which are especially designed for the analysis of the singing voiceare described and their physical interpretation is given. Decision systems basedon articial neutral networks and rough sets are used for automatic voicetype/voice quality classication.

    Limiting the decrease in performance due to acoustic environment changesremains a major challenge for continuous speech recognition (CSR) sys-tems. Selouani and Shaughnessy [25] proposed a hybrid enhancement noisereduction approach in the cepstral domain in order to get less-variant pa-rameters. It is based on the KarhunenLoeve Transform (KLT) in the mel-frequency domain with a Genetic Algorithm (GA). The enhanced parametersincreased the recognition rate for highly interfering noise environments. Theproposed hybrid technique, when included in the front-end of an HTK-basedCSR system, outperformed the conventional recognition process in severe in-terfering car noise environments for a wide range of signal-to-noise ratios(SNRs) varying from 16 dB to 4 dB. They also showed the eectivenessof the KLT-GA method in recognizing speech subject to telephone channeldegradations.

    CI in Speech Emotion Recognition

    Speech emotion recognition is becoming more and more important in suchcomputer application elds as health care, children education, etc. Only fewworks have been done on speech emotion recognition using such methodsas ANN, SVM, etc., in the last years. Feature sets are broadly discussedwithin speech emotion recognition by acoustic analysis. While popular lterand wrapper based search help to retrieve relevant ones, we feel that auto-matic generation of such allows for more exibility throughout search. Thebasis is formed by dynamic Low-Level Descriptors considering intonation, in-tensity, formants, spectral information and others. Next, systematic derivationof prosodic, articulatory, and voice quality high level functionals is performedby descriptive statistical analysis. From here on feature alterations are auto-matically fullled, to nd an optimal representation within feature space inview of a target classier. In addition, traditional feature selection methodused in speech emotion recognition is computationally too expensive to deter-mine an optimum or suboptimum feature subset. Focusing on these problems,many successful works have been addressed and discussed. For example, Zhouet al. [40] presented a novel approach based on rough set theory and SVM forspeech emotion recognition. The experiment results illustrated that the intro-duced approach can reduce the calculation cost while keeping high recognitionrate. Also, Schuller et al. [61] suggested the use of evolutionary programmingto avoid NP-hard exhaustive search.

    Fellenz et al. [44] proposed a framework for the processing of face imagesequences and speech, using dierent dynamic techniques to extract appro-priate features for emotion recognition. The features were used by a hybrid

  • Computational Intelligence in Multimedia Processing 27

    classication procedure, employing neural network techniques and fuzzy logic,to accumulate the evidence for the presence of an emotional expression of theface and the speakers voice.

    Buscicchio et al. [19] proposed a biologically plausible methodology for theproblem of emotion recognition, based on the extraction of vowel informationfrom an input speech signal and on the classication of extracted informationby a spiking neural network. Initially, a speech signal is segmented into vowelparts which are represented with a set of salient features, related to the Mel-frequency cesptrum. Dierent emotion classes are then recognized by a spikingneural network and classied into ve dierent emotion classes.

    AudioVisual Speech Recognition

    AudioVisual Speech Recognition (AVSR) [63] is a technique that uses im-age processing capabilities in lip reading to aid speech recognition systems inrecognizing undeterministic phones or giving preponderance among near prob-ability decisions. A great interest in the research of AVSR systems is drivenby the increase in the number of multimedia applications that require robustspeech recognition systems. The use of visual features in AVSR is justied byboth the audio and visual modality of the speech generation and the need forfeatures that are invariant to acoustic noise perturbation. The performance ofthe AVSR system relies on a robust set of visual features obtained from theaccurate detection and tracking of the mouth region. Therefore the mouthtracking plays a major role in AVSR systems. Moreover, A human listenercan use visual cues, such as lip and tongue movements, to enhance the levelof speech understanding, especially in a noisy environment. The process ofcombining the audio modality and the visual modality is referred to as speechreading, or lip reading. There are many applications in which it is desired torecognize speech under extremely adverse acoustic environments. Detecting apersons speech from a distance or through a glass window, understanding aperson speaking among a very noisy crowd of people, and monitoring a speechover TV broadcast when the audio link is weak or corrupted, are some exam-ples. Computational intelligence techniques plays an important role in this re-search direction. A number of CI-based AVSR methods have been proposed inthe literature. For example, Lim et al. [39] presented an improvement versionof mouth tracking technique using radial basis function neural network (RBFNN) with its applications to AVSR systems. A modied extended Kalmanlter (EKF) was used to adjust the parameters of the RBF NN. Simulationresults have revealed good performance of the proposed method.

    Automatic Speech Recognition (ASR) performs well under restricted con-ditions, but performance degrades in noisy environments. AVSR combats thisby incorporating a visual signal into the recognition. Lewis and Powers [62]discussed how to improve the performance of a standard speech recognitionsystems by using information from the traditional, auditory signals as well

  • 28 A.-E. Hassanien et al.

    as a visual signals. Using a knowledge from psycholinguistics, a late integra-tion network was developed that fused the automatic and visual sources. Animportant rst step in AVSR is that of feature extraction from the mouthregion and a technique developed by the authors is briey presented. Authorsexamined how useful this extraction technique in combination with severalintegration architectures is at the given task, demonstrates that vision doesin fact assist speech recognition when used in a linguistically guided fashion,and gives insight remaining issues.

    Alessandro et al. [38] focused the attention on the problem of audio clas-sication in speech and music for multimedia applications. In particular, theypresented a comparison between two dierent techniques for speech/musicdiscrimination. The rst method is based on zero crossing rate and Bayesianclassication. It is very simple from a computational point of view, and givesgood results in case of pure music or speech. The simulation results showthat some performance degradation arises when the music segment containsalso some speech superimposed on music, or strong rhythmic components. Toovercome these problems, they proposed a second method, that uses more fea-tures, and is based on neural networks (specically a multi-layer Perceptron).It is reported that the introduced algorithm is obtain better performance, atthe expense of a limited growth in the computational complexity. In practice,the proposed neural network is simple to be implemented if a suitable poly-nomial is used as the activation function, and a real-time implementation ispossible even if low-cost embedded systems are used.

    Speech recognition techniques have been developed dramatically in recentyears. Nevertheless, errors caused by environmental noise are still a seriousproblem in recognition. Employing algorithms to detect and follow the motionof lips have been widely used to improve the performance of speech recogni-tion algorithms. Vahideh and Yaghmaie [65] presented a simple and ecientmethod for extraction of visual features of lip to recognize vowels based onthe neural networks. The accuracy is veried by using it to recognize six mainFarsi vowels.

    Faraj and Bigun [41] described a new identity authentication technique bya synergetic use of lip-motion and speech. The lip-motion is dened as the dis-tribution of apparent velocities in the movement of brightness patterns in animage and is estimated by computing the velocity components of the structuretensor by 1D processing, in 2D manifolds. Since the velocities are computedwithout extracting the speakers lip-contours, more robust visual features canbe obtained in comparison to motion features extracted from lip-contours. Themotion estimations are performed in a rectangular lip-region, which aordsincreased computational eciency. A person authentication implementationbased on lip-movements and speech is presented along with experiments ex-hibiting a recognition rate of 98%. Besides its value in authentication, thetechnique can be used naturally to evaluate the liveness of someone speakingas it can be used in text-prompted dialogue. The XM2VTS database was used

  • Computational Intelligence in Multimedia Processing 29

    for performance quantication as it is currently the largest publicly availabledatabase (300 persons) containing both lip-motion and speech. Comparisonswith other techniques are presented.

    Shan Meng and Youwei Zhang [58] described a method of visual speechfeature area localization First, they propose a simplied human skin colormodel to segment input images and estimate the location of human face.Authors proposed a new localization method that is a combination of SVMand Distance of Likelihood in Feature Space (DLFS) derived from KernelPrincipal Component Analysis (KPCA). Results show that the introducedmethod outperformed traditional linear ones. All experiments were based onChinese AudioVisual Speech Database(CAVSD).

    4 Computational Intelligence in Video Processing

    Edge extraction, texture classication, face recognition, character recognition,nger print identication, image/video enhancement, image/video segmenta-tion and clustering, and image/video coding are some of the applications ofcomputational intelligence in image processing. Here we demonstrated somereported examples of using the CI techniques in multimedia processing and,in particulars in image/video processing. As a result, there has been muchrecent research interest in this area. Many successful work towered this issuehas been addressed and discussed. Here, we review some successful work toillustrate how CI could be applied to resolve video segmentation problem.

    Computational Intelligence in Video Segmentation

    Successful video segmentation is necessary for most multimedia applications.In order to analyze a video sequence, it is necessary to break it down intomeaningful units that are of smaller length and have some semantic coherence.Video segmentation is the process of dividing a sequence of frames into smallermeaningful units that represent information at the scene level. This processserves as a fundamental step towards any further analysis on video frames forcontent analysis. In the past, several statistical methods that compare framedierences have been published in literature and a range of similarity measuresbetween frames based on gray-scale intensity, color and texture have beenproposed. Here we demonstrate a succuss works in using the CI techniques invideo segmentation.

    The organization of video information in video databases requires auto-matic temporal segmentation with minimal user interaction. As neural net-works are capable of learning the characteristics of various video segmentsand clustering them accordingly. Cao and Suganthan [27] developed a neuralnetwork based technique to segment the video sequence into shots automat-ically and with a minimum number of user-dened parameters. They pro-pose to employ Growing Neural Gas (GNG) networks and integrate multiple

  • 30 A.-E. Hassanien et al.

    frame dierence features to eciently detect shot boundaries in the video.Experimental results were presented to illustrate the good performance of theproposed scheme on real video sequences.

    Lo and Wang [26] proposed a video segmentation method using aHistogram-Based Fuzzy C-Means (HBFCM) clustering algorithm. This algo-rithm is a hybrid of two approaches and is composed of three phases: thefeature extraction phase, the clustering phase, and the key-frame selectionphase. In the rst phase, dierences between color histogram are extractedas features. In the second phase, the Fuzzy C-Means (FCM) is used to groupfeatures into three clusters: the shot change (SC) cluster, the Suspected ShotChange (SSC) cluster, and the No Shot Change (NSC) cluster. In the lastphase, shot change frames are identied from the SC and the SSC, and thenused to segment video sequences into shots. Finally, key frames are selectedfrom each shot. Authors simulation results indicate that the HBFCM cluster-ing algorithm is robust and applicable to various types of video sequences.

    Ford [20] presented a fuzzy logic system for the detection and classicationof shot boundaries in uncompressed video sequences. It integrates multiplesources of information and knowledge of editing procedures to detect shotboundaries. Furthermore, the system classies the editing process employedto create the shot boundary into one of the following categories: abrupt cut,fade-in, fade-out, or dissolve. This system was tested on a database containinga wide variety of video classes. It achieved combined recall and precision ratesthat signicantly exceed those of existing threshold-based techniques, and itcorrectly classied a high percentage of the detected boundaries.

    Video temporal segmentation is normally the rst and important step forcontent-based video applications. Many features including the pixel dierence,color histogram, motion, and edge information, etc., have been widely usedand reported in the literature to detect shot cuts inside videos. Although ex-isting research on shot cut detection is active and extensive, it still remainsa challenge to achieve accurate detection of all types of shot boundaries withone single algorithm. Hui Fang et al. [24] proposed a fuzzy logic approach tointegrate hybrid features for detecting shot boundaries inside general videos.The fuzzy logic approach contains two processing modes, where one is dedi-cated to detection of abrupt shot cuts including those short dissolved shots,and the other for detection of gradual shot cuts. These two modes are uniedby a mode-selector to decide which mode the scheme should work on in orderto achieve the best possible detection performances. By using the publiclyavailable test data set from Carleton University, extensive experiments werecarried out and the test results illustrate that the proposed algorithm out-performs the representative existing algorithms in terms of the precision andrecall rates.

    Mitra [71] proposed an evolutionary rough c-means clustering algorithm.Genetic algorithms are employed to tune the threshold, and relative impor-tance of upper and lower approximations of the rough sets modeling theclusters. The DaviesBouldin clustering validity index is used as the tness

  • Computational Intelligence in Multimedia Processing 31

    function, that is minimized while arriving at an optimal partitioning. A com-parative study of its performance is made with related partitive algorithms.The eectiveness of the algorithm is demonstrated on real and synthetic datasets, including microarray gene expression data from Bioinformatics. In thesame study, the author noted that the parameter threshold measures the rela-tive distance of an object Xk from a pair of clusters having centroids ceni andcenj . The smaller the value of threshold, the more likely is Xk to lie withinthe rough boundary (between upper and lower approximations) of a cluster.This implies that only those points which denitely belong to a cluster (lieclose to the centroid) occur within the lower approximation. A large value ofthreshold implies a relaxation of this criterion, such that more patterns areallowed to belong to any of the lower approximations. The parameter wlowcontrols the importance of the objects lying within the lower approximationof a cluster in determining its centroid. A lower wlow implies a higher wup,and hence an increased importance of patterns located in the rough boundaryof a cluster towards the positioning of its centroid.

    Das et al. [103] presented a framework to hybridize the rough set theorywith particle swarm optimization algorithm. The hybrid rough-PSO techniquehas been used for grouping the pixels of an image in its intensity space. Medicalimages become corrupted with noise very often. Fast and ecient segmenta-tion of such noisy images (which is essential for their further interpretation inmany cases) has remained a challenging problem for years. In there work, theytreat image segmentation as a clustering problem. Each cluster is modeledwith a rough set. PSO is employed to tune the threshold and relative impor-tance of upper and lower approximations of the rough sets. Davies-Bouldinclustering validity index is used as the tness function, which is minimizedwhile arriving at an optimal partitioning.

    Raducanu et al. [106] proposed a Morphological Neural Networks (MNN)algorithm as associative (with its two cases: autoassociative and heteroassocia-tive) memories. It propose their utilization as a preprocessing step for humanshape detection, in a vision-based navigation problem for mobile robots. It isreported that the MNN can be trained in a single computing step, they pos-sess unlimited storing capacity, and they have perfect recall of the patterns.Recall is also very fast, because the MNN recall does not involve the searchfor an energy minimum.

    Adaptation of C-Means to Rough Set Theory

    C-means clustering is an iterative technique that is used to partition an im-age into C-clusters. Fuzzy C-Means (FCM) is one of the most commonlyused fuzzy clustering techniques for dierent degree estimation problems, es-pecially in medical image processing [104, 107, 116]. Lingras [70] describedmodications of clustering based on Genetic Algorithms, K-means algorithm,and Kohonen Self-Organizing Maps (SOM). These modications make it pos-sible to represent clusters as rough sets [97]. In their work, Lingras established

  • 32 A.-E. Hassanien et al.

    a rough k-means framework and extended the concept of c-means by viewingeach cluster as an interval or rough set [69]. Here is a brief summary of hispioneer clustering work.

    K-means clustering is one of the most popular statistical clustering tech-niques used in segmentation of medical images [66,72,94,108110]. The nameK-means originates from the means of the k clusters that are created fromn objects. Let us assume that the objects are represented by m-dimensionalvectors. The objective is to assign these n objects to k clusters. Each of theclusters is also represented by an m-dimensional vector, which is the centroidor mean vector for that cluster. The process begins by randomly choosing k ob-jects as the centroids of the k clusters. The objects are assigned to one of the kclusters based on the minimum value of the distance d(v, x) between the objectvector v = (v1, . . . , vj , . . . , vm) and the cluster vector x = (x1, . . . , xj , . . . , xm).After the assignment of all the objects to various clusters, the new centroidvectors of the clusters are calculated as

    xj =

    vx vjSOC

    ,where 1 j m , (8)where SOC is the size of cluster x.

    Lingras [70] mentioned that incorporation of rough sets into K-means clus-tering requires the addition of the concept of lower and upper bounds. Cal-culation of the centroids of clusters from conventional K-Means needs to bemodied to include the eects of lower as well as upper bounds. The modiedcentroid calculations for rough sets are then given by:

    cenj = Wlow

    vR(x)|R(x)| + wup

    v(BNR(x))|BNR(x)| , (9)

    where 1 j m. The parameters wlower and w(upper) correspond to the rela-tive importance of lower and upper bounds, and wlow +wup = 1. If the upperbound of each cluster were equal to its lower bound, the clusters would beconventional clusters. Therefore, the boundary region BNR(x) will be empty,and the second term in the equation will be ignored. Thus, the above equa-tion will reduce to conventional centroid calculations. The next step in themodication of the K-means algorithms for rough sets is to design criteria todetermine whether an object belongs to the upper or lower bound of a clus-ter, for more details refer to. The main steps of the algorithm are provided inAlgorithm 1.

    5 Computational Intelligence in MultimediaWatermarking

    Multimedia watermarking technology has evolved very quickly during thelast few years. A digital watermark is information that is imperceptiblyand robustly embedded in the host data such that it cannot be removed.

  • Computational Intelligence in Multimedia Processing 33

    Algorithm 1 Rough C-mean Algorithm1: Set xi as an initial means for the c clusters.2: Initialize the population of particles encoding parameters threshold and wlow3: Initialize each data object xk to the lower approximation or upper approximation

    of clusters ci by computing the dierence in its distance by:

    di = d(xk, ceni) d(xk, cenj), (10)where ceni and cenj are the cluster centroid pairs.

    4: if di < then5: xk the upper approximation of the ceni and cenj clusters and can not be

    in any lower approximation.6: Else7: xk lower approximation of the cluster ci such that distance d(xk, ceni) is

    minimum over the c clusters.8: end if9: Compute a new mean using equation 15

    10: repeatstatement 3-9

    11: until convergence, i.e., there is no more new assignments

    A watermark typically contains information about the origin, status, or re-cipient of the host data. The digital watermarking system essentially consistsof a watermark encoder and a watermark decoder. The watermark encoderinserts a watermark onto the host signal and the watermark decoder detectsthe presence of watermark signal. Note that an entity called watermark keyis used during the process of embedding and detecting watermarks. The wa-termark key has a one-to-one correspondence with watermark signal (i.e., aunique watermark key exists for every watermark signal). The watermark keyis private and known to only authorized parties and it ensures that only autho-rized parties can detect the watermark. Further, note that the communicationchannel can be noisy and hostile (i.e., prone to security attacks) and hence thedigital watermarking techniques should be resilient to both noise and securityattacks. Figure 7 illustrates the digital watermark methodology in general.

    The development of watermarking methods involves several design trade-os: (1) Robustness which deals with the ability of the watermark to resistattempts by an attacker to destroy it by modifying the size, rotation, qual-ity, or other visual aspects of the video; (2) Security which deals with theability of the watermark to resist attempts by a sophisticated attacker to re-move it or destroy it via cryptanalysis, without modifying the media itself;and (3) Perceptual delity the perceived visual quality of the marked me-dia compared to the original, unmarked video. A copyright protection is themost prominent application of watermarking techniques, others exist, includ-ing data authentication by means of fragile watermarks which are impairedor destroyed by manipulations, embedded transmission of value added ser-vices within multimedia data, and embedded data labeling for other purposes

  • 34 A.-E. Hassanien et al.

    Fig. 7. General digital watermarking architecture [9]

    than copyright protection, such as data monitoring and tracking. An examplefor a data-monitoring system is the automatic registration and monitoring ofbroadcasted radio programs such that royalties are automatically paid to theIPR owners of the broadcast data. Focusing on these problems, many suc-cessful works have been addressed and discussed. For example, Lou et al. [32]proposed a copyright protection scheme based on chaos and secret sharingtechniques. Instead of modifying the original image to embed a watermarkin it, the proposed scheme extracts a feature from the image rst. Then, theextracted feature and the watermark are scrambled by a chaos technique.Finally, the secret sharing technique is used to construct a shadow image.The watermark can be retrieved by performing an XOR operation betweenthe shadow images. It is reported that the introduced scheme compared withother works is secure and robust in resisting various attacks.

    Cao et al. [37] proposed a novel audio watermarking algorithm based onneural networks. By transforming original audio sequence into 1D waveletdomain and selecting proper positions, several watermark bits were embed.Before transmitting, it eectively utilizes neural networks to learn the relationcharacteristics between original audio and watermarked audio. Due to thelearning and adaptive capabilities of neural networks possessing, the trainedneural networks almost exactly extract the watermark from the watermarkedaudio against audio processing attacks. Extensive experimental results showedthat the proposed method signicantly possesses robustness. It is immuneagainst such attacks as low pass ltering, addition of noise, resampling andmedium ltering.

  • Computational Intelligence in Multimedia Processing 35

    Wei Lu et al. [33] presented a robust digital image watermarking schemeby using neural network detector. First, the original image is divided into foursubimages by using subsampling. Then, a random binary watermark sequenceis embedded into DCT domain of these subimages. A xed binary sequenceis added to the head of the payload watermark as the samples to train theneural network detector. Because of the good adaptive and learning abilities,the neural network detector can nearly exactly extract the payload watermark.Experimental results illustrated good performance of the proposed scheme onresisting common signal processing attacks.

    Lou and Yin [30] proposed adaptive digital watermarking approach basedupon human visual system model and fuzzy clustering technique. The humanvisual system model is utilized to guarantee that the watermarked imageis imperceptible, while the fuzzy clustering approach has been employed toobtain the dierent strength of watermark by the local characters of image.In their experiments, the scheme provides a more robust and transparentwatermark.

    Cheng-Ri Piao et al. [34] proposed a new watermarking scheme in whicha logo watermark is embedded into the Discrete Wavelet Transform (DWT)domain of the color image using Back-Propagation Neural networks (BPN).In order to strengthen the imperceptibility and robustness, the original im-age is transformed from RGB color space to brightness and chroma space(YCrCb). After transformation, the watermark is embedded into DWT co-ecient of chroma component, CrCb. A secret key determines the locationsin the image where the watermark is embedded. This process prevents pos-sible pirates from removing the watermark easily. BPN learns the charac-teristics of the color image, and then watermark is embedded and extractedby using the trained neural network. Experimental results showed that theproposed method has good imperceptibility and high robustness to commonimage processing attacks.

    Zheng Liu et al. [35] introduced a sensor-based authentication watermark-ing with the concept of authentication on demand, in which user requirementsare adopted as parameters for authentication. In addition, fuzzy identicationof multiple authentication sensors outputs has introduced the ability of nelytuning on authentication type and degree. With this approach, authenticationsensitivity to malicious attacks is enhanced. It is reported that the introducedapproach is more robust against allowed modications. In addition, authorsalgorithm provide a new function, to detect the attack method.

    Maher et al. [31] proposed a novel digital video watermarking scheme basedon multi resolution motion estimation and articial neural network. A multiresolution motion estimation algorithm is adopted to preferentially allocatethe watermark to coecients containing motion. In addition, embedding andextraction of the watermark are based on the relationship between a waveletcoecient and its neighbors. A neural network was given to memorize therelationships between coecients in a 3 3 block of the image. Experimentalresults illustrated that embedding watermark where picture content is moving

  • 36 A.-E. Hassanien et al.

    is less perceptible. Further, empirical results demonstrated that the proposedscheme is robust against common video processing attacks.

    Several discrete wavelet transform based techniques are used for water-marking digital images. Although these techniques are robust to some attacks,none of them is robust when a dierent set of parameters is used or some otherattacks (such as low pass ltering) are applied. In order to make the water-mark stronger and less susceptible to dierent types of attacks, it is essentialto nd the maximum amount of watermark before the watermark becomesvisible. Davis and Najarian [111] used the neural networks to implement anautomated system of creating maximum-strength watermarks.

    Diego and Manuel [29] proposed an evolutionary algorithm for the en-hancement of digital semi-fragile watermarking based on the manipulation ofthe image Discrete Cosine Transform (DCT). The algorithm searches for theoptimal localization of the DCT of an image to place the mark image DCTcoecients. The problem is stated as a multi-objective optimization prob-lem (MOP), that involves the simultaneous minimization of distortion androbustness criteria.

    Chang et al. [28] developed a novel transform domain digital watermark-ing scheme that uses visually meaningful binary image as watermark. Thescheme embeds the watermark information adaptively with localized embed-ding strength according to the noise sensitivity level of the host image. Fuzzyadaptive resonance theory (Fuzzy-ART) classication is used to identify ap-propriate locations for watermark insertion and its control parameters addagility to the clustering results to thwart counterfeiting attacks. The scal-ability of visually recognizable watermark is exploited to devise a robustweighted recovery method with composite watermark. The proposed water-marking schemes can also be employed for oblivious detection. Unlike mostoblivious watermarking schemes, our methods allow the use of visually mean-ingful image as watermark. For automation friendly verication, a normalizedcorrelation metric that suits well with the statistical property of their methodsis used. The experimental results demonstrated that the proposed techniquescan survive several kinds of image processing attacks and the JPEG lossycompression.

    Tsai et al. [36] proposed a new intelligent audio watermarking methodbased on the characteristics of the HAS and the techniques of neural net-works in the DCT domain. The method makes the watermark imperceptibleby using the audio masking characteristics of the HAS. Moreover, the methodexploits a neural network for memorizing the relationships between the origi-nal audio signals and the watermarked audio signals. Therefore, the methodis capable of extracting watermarks without original audio signals. Finally,the experimental results are also included to illustrate that the method sig-nicantly possesses robustness to be immune against common attacks for thecopyright protection of digital audio.

  • Computational Intelligence in Multimedia Processing 37

    6 Computational Intelligence in Content-BasedMultimedia Indexing and Retrieval

    There are a growing number of applications, which extensively use the visualmedia. A key requirement in those applications is ecient access to the storedmultimedia information for the purposes of indexing, fast retrieval, and sceneanalysis. The amounts of multimedia content available to the public and toresearchers has been growing rapidly in the last decades and is expected toincrease exponentially in the years to come. This development puts a greatemphasis on automated content-based retrieval methods, which retrieve andindex multimedia based on its content. Such methods, however, suer froma serious problem: the semantic gap, i.e., the wide gulf between the low-levelfeatures used by computer systems and the high-level concepts understood byhuman beings.

    Mats et al. [46] proposed a method of content-based multimedia retrievalof objects with visual, aural and textual properties. In their method, train-ing examples of objects belonging to a specic semantic class are associatedwith their low-level visual descriptors (such as MPEG-7) and textual featuressuch as frequencies of signicant keywords. A fuzzy mapping of a semanticclass in the training set to a class of similar objects in the test set is createdby using Self-Organizing Maps (SOMs) trained from automatically extractedlow-level descriptors. Authors performed several experiments with dierenttextual features to evaluate the potential of their approach in bridging thegap from visual features to semantic concepts by the use textual presenta-tions. Their initial results show a promising increase in retrieval performance.PicSOM [45] content-based information retrieval (CBIR) system was usedwith video data and semantic classes from the NIST TRECVID 20051 eval-uation set. The TRECVID set contains TV broadcasts in dierent languagesand textual data acquired by using automatic speech recognition software andmachine translation where appropriate. Both the training and evaluation setsare were accompanied with veried semantic ground truth sets such as videosdepicting explosions or re.

    In the PicSOM system the videos and the parts extracted from thesewere arranged as hierarchical trees as shown in Fig. 8, with the main videoas the parent object and the dierent extracted media types as child objects.In this way, the relevance assessments can be transferred between relatedobjects in the PicSOM algorithm. From each media type dierent featureswere extracted, and Self-Organizing Maps were trained from these as is shownwith some examples in the Fig. 8.

    Ming Li and Tong Wang [22] presented a new image retrieval techniquebased on concept lattices, named Concept Lattices- Based Image Retrieval,lattice browsing allows one to reach a group of images via one path. Becauseit can produce a lot of redundancies attributes when constructing the conceptlattices by using a general method. In addition, authors proposed a methodof attribute reduction of concept lattices based on discernibility matrix and

  • 38 A.-E. Hassanien et al.

    Fig. 8. The hierarchy of videos and examples of multi-modal SOMs [46]

    boolean calculation to reduce the context of concept lattices. The scale of theproblem is reduced by using this method. At the same time, the eciency ofimage retrieval is improved, which is reected in the experiments.

    Fuzzy set methods have been already applied to the representation ofexible queries and to the modeling of uncertain pieces of information in data-bases systems, as well as in information retrieval. This methodology seems tobe even more promising in multimedia databases which have a complex struc-ture and from which documents have to be retrieved and selected not onlyfrom their contents, but also from the idea the user has of their appearance,through queries specied in terms of users criteria. Dubois et al. [14] provideda preliminary investigation of the potential applications of fuzzy logic in mul-timedia databases. The problem of comparing semi-structured documents isrst discussed. Querying issues are then more particularly emphasized. Theydistinguish two types of request, namely, those which can be handled withinsome extended version of an SQL-like language and those for which one hasto elicit users preference through examples.

    Hassanien and Jafar [8] presented an application of rough sets to featurereduction, classication and retrieval for image databases in the frameworkof content-based image retrieval systems. The presented description of roughsets theory emphasizes the role of reducts in statistical feature selection, datareduction and rule generation in image databases. A key feature of the in-troduced approach is that segmentation and detailed object representationare not required. In order to obtain better retrieval results, the image texturefeatures can be combined with the color features to form a powerful discrimi-nating feature vector for each image. Texture features from the co-occurrencematrix are extracted, represented and, normalized in attribute vector then

  • Computational Intelligence in Multimedia Processing 39

    the rough set dependency rules are generated directly from the real valueattribute vector. Then the rough set reduction technique is applied to ndall reducts of the data which contains the minimal subset of attributes thatare associated with a class label for classication. A new similarity distancemeasure based on rough sets was presented. The classication and retrievalperformance are measured using recall-precision measure, as is standard in allcontent based image retrieval systems. Figure 9 illustrates the image classi-cation and retrieval scheme based on the rough set theory framework. (Seealso [114])

    Chen and Wang [113] proposed a fuzzy logic approach UFM (UniedFeature Matching), for region-based image retrieval. In their retrieval sys-tem, an image is represented by a set of segmented regions, each of which is

    Fig. 9. CBIR in rough sets frameworks [8]

  • 40 A.-E. Hassanien et al.

    characterized by a fuzzy feature (fuzzy set) reecting color, texture, and shapeproperties. As a result, an image is associated with a family of fuzzy featurescorresponding to regions. Fuzzy features naturally characterize the gradualtransition between regions (blurry boundaries) within an image and incorpo-rate the segmentation-related uncertainties into the retrieval algorithm. Theresemblance of two images is then dened as the overall similarity betweentwo families of fuzzy features and quantied by a similarity measure, UFMmeasure, which integrates properties of all the regions in the images. Com-pared with similarity measures based on individual regions and on all regionswith crisp-valued feature representations, the UFM measure greatly reducesthe inuence of inaccurate segmentation and provides a very intuitive quan-tication. The UFM has been implemented as a part of authors experimentalimage retrieval system. The performance of the system was illustrated usingexamples from an image database of about 60,000 general-purpose images.

    As digital video databases become more and more pervasive, nding videoin large databases becomes a major problem. Because of the nature of video(streamed objects), accessing the content of such databases is inherently atime-consuming operation. Kulkarni [23] proposed a neural-fuzzy based ap-proach for retrieving a specic video clip from a video database. Fuzzy logicwas used for expressing queries in terms of natural language and a neuralnetwork is designed to learn the meaning of these queries. The queries weredesigned based on features such as color and texture of shots, scenes andobjects in video clips. An error backpropagation algorithm was proposed tolearn the meaning of queries in fuzzy terms such as very similar, similar andsome-what similar. Preliminary experiments were conducted on a small videodatabase and dierent combinations of queries using color and texture featuresalong with a visual video clip; very promising results were achieved.

    7 Conclusions, Challenges and Future Directions

    During the last decades, multimedia processing has emerged as an importanttechnology to generate content based on images, video, audio, graphics, andtext. Furthermore, the recent new development represented by High Deni-tion Multimedia content and Interactive television will generate a huge vol-ume of data and important computing problems connected with the creation,processing and management of Multimedia content. Multimedia processing isa challenging domain for several reasons: it requires both high computationrates and memory bandwidth; it is a multirate computing problem; and re-quires low-cost implementations for high-volume markets. The past years havewitnessed a large number of interesting applications of various computationalintelligence techniques, such as neural networks; fuzzy logic; evolutionary com-puting; swarm intelligence; reinforcement Learning and evolutionary com-putation, rough sets, and a generalization of rough sets called near sets,to intelligent multimedia processing. Therefore, multimedia computing and

  • Computational Intelligence in Multimedia Processing 41

    communication is another challenge and fruitful area for CI to play cru-cial roles in resolving problems and providing solutions to multimedia im-age/audio/video processing that understand, represent and process the media,their segments, indexing and retrieval.

    Another challenge is to develop near sets-based methods, which oer ageneralization of traditional rough set theory and a approach to classifyingperceptual objects by means of features could be lead to new and will be use-ful in solving object recognition, particularly in solving multimedia problemssuch as classication and segmentation as well as to an application of thenear set approach in 2D and 3D interactive gaming with a vision system thatlearns and serves as the backbone for an adaptive telerehabilitation systemfor patients with nger, hand, arm and balance disabilities. Each remote nodein the telerehabilitation system includes a vision system that learns to trackthe behavior of a patient. Images deemed to be interesting (e.g., images rep-resenting erratic behavior) are stored as well as forwarded to a rehabilitationcenter for follow up. In such a system, there is a need to identify images thatare in some sense near images representing some standard or norm. This re-search has led to a study of methods of automating image segmentation as arst step in near set-based image processing.

    In recent years, there has been a rapidly increasing demand for the de-velopment of advanced interactive multimedia applications, such as videotelephony, video games and TV broadcasting have resulted in spectacularstrides in the progress of wireless communication systems. However, theseapplications are always stringently constrained by current wireless system ar-chitectures because the request of high data rate for video transmission. Tobetter serve this need, 4G broadband mobile systems are being developed andare expected to increase the mobile data transmission rates and bring higherspectral eciency, lower cost per transmitted bit, and increased exibilityof mobile terminals and networks. The new technology strives to eliminatethe distinction between video over wireless and video over wireline networks.In the meantime, great opportunities are provided for proposing novel wire-less video protocols and applications, and developing advanced video codingand communications systems and algorithms for the next-generation videoapplications that can take maximum advantage of the 4G wireless systems.New video applications over 4G wireless systems is a challenge for the CIresearchers.

    The current third generation (3G) wireless systems and the next gen-eration (4G) wireless systems in the development stages support higher bitrates. However, the high error rates and stringent delay constraints in wirelesssystems are still signicant obstacles for these applications and services. Onthe other hand, the development of more advanced wireless systems providesopportunities for proposing novel wireless multimedia protocols and new ap-plications and services that can take the maximum advantage of the systems.

    In mobile ad hoc networks, specic intrusion detection systems are neededto safeguard them since traditional intrusion prevention techniques are not

  • 42 A.-E. Hassanien et al.

    sucient in the protection of mobile ad hoc networks [1]. Therefore, intrusiondetection system is another challenge and fruitful area for CI to play crucialroles in resolving problems and providing solutions to intrusion detection sys-tems and authenticate the maps produced by the application of the intelligenttechniques using watermarking, biometric and cryptology technologies.

    A combination of kinds of computational intelligence techniques in appli-cation area of multimedia processing has become one of the most importantways of research of intelligent information processing. Neural network showsus its strong ability to solve complex problems for many multimedia process-ing. From the perspective of the specic rough sets approaches that need tobe applied, explorations into possible applications of hybridize rough sets withother intelligent systems like neural networks [96], genetic algorithms, fuzzyapproaches, etc., to multimedia processing and pattern recognition, in par-ticulars in multimedia computing problems could lead to new and interestingavenues of research and is always a challenge for the CI researchers.

    In conclusion, many successful algorithms applied in multimedia process-ing have been reported in the literature and the applications of rough setsin multimedia processing have to be analyzed individually. Rough set is anew challenge to deal with the issues that can not be addressed by tradi-tional image processing algorithms or by other classication techniques. Byintroducing rough set, algorithms developed for multimedia processing andpattern recognition often become more intelligent and robust that provides ahuman-interpretable, low cost, exact enough solution, as compared to otherintelligence techniques.

    Finally, the main purpose of this article is to present to the CI and mul-timedia research communities the state of the art in CI applications to mul-timedia computing, and to inspire further research and development on newapplications and new concepts in new trend-setting directions and in exploit-ing computational intelligence.


    1. Abraham A., Jain R., Thomas J., and Han S.Y. (2007) D-SCIDS: Distributedsoft computing intrusion detection systems. Journal of Network and ComputerApplications, vol. 30, no. 1, pp. 8198.

    2. Bishop C.M. (1995) Neural Networks for Pattern Recognition. Oxford Univer-sity Press, Oxford.

    3. Kohonen T. (1988) Self-Organization and Associative Memory. Springer, BerlinHeidelberg New York.

    4. Carpenter G. and Grossberg S. (1995) Adaptive Resonance Theory (ART). In:Arbib M.A. (ed.), The Handbook of Brain Theory and Neural Networks. MIT,Cambridge, pp. 7982.

    5. Grossberg S. (1976) Adaptive pattern classication and universal recod-ing: Parallel development and coding of neural feature detectors. BiologicalCybernetics, vol. 23, pp. 121134.

  • Computational Intelligence in Multimedia Processing 43

    6. Abraham A. (2001) Neuro-fuzzy systems: State-of-the-art modeling techniques,connectionist models of neurons, learning processes, and articial intelligence.In: Jose Mira and Alberto Prieto (eds.), Lecture Notes in Computer Science,vol. 2084, Springer, Berlin Heidelberg New York, pp. 269276.

    7. Nguyen H.T. and Walker E.A. (1999) A First Course in Fuzzy Logic. CRC,Boca Raton.

    8. Hassanien A.E. and Jafar Ali (2003) Image classication and retrieval algo-rithm based on rough set theory. South African Computer Journal (SACJ),vol. 30, pp. 916.

    9. Hassanien A.E. (2006) Hiding iris data for authentication of digital imagesusing wavelet theory. International journal of Pattern Recognition and ImageAnalysis, vol. 16, no. 4, pp. 637643.

    10. Hassanien A.E., Ali J.M., and Hajime N. (2004) Detection of spiculated massesin Mammograms based on fuzzy image processing. In: 7th International Con-ference on Articial Intelligence and Soft Computing, ICAISC2004, Zakopane,Poland, 711 June. Lecture Notes in Articial Intelligence, vol. 3070. Springer,Berlin Heidelberg New York, pp. 10021007.

    11. Fogel L.J., Owens A.J., and Walsh M.J. (1967) Articial Intelligence ThroughSimulated Evolution. Wiley, New York.

    12. Fogel D.B. (1999) Evolutionary Computation: Toward a New Philosophy ofMachine Intelligence, 2nd edition. IEEE, Piscataway, NJ.

    13. Pearl J. (1997) Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaufmann, San Francisco.

    14. Dubois D., Prade H., and Se`des F. (2001) Fuzzy logic techniques in multi-media database querying: A preliminary investigation of the potentials. IEEETransactions on Knowledge and Data Engineering, vol. 13 , no. 3, pp. 383392.

    15. Holland J. (1975) Adaptation in Natural and Articial Systems. University ofMichigan Press, Ann Harbor.

    16. Goldberg D.E. (1989) Genetic Algorithms in Search, Optimization, andMachine Learning. Addison-Wesley, Reading.

    17. Zlokolica V., Piurica A., Philips W., Schulte S., and Kerre E. (2006) Fuzzylogic recursive motion detection and denoising of video sequences. Journal ofElectronic Imaging, vol. 15, no. 2.

    18. Koza J.R. (1992) Genetic Programming. MIT, Cambridge, MA.19. Buscicchio C.A., Grecki P., and Caponetti L. (2006) Speech emotion recog-

    nition using spiking neural networks. In: Esposito F., Ras Z.W., Malerba D.,and Semeraro G. (eds.), Foundations of Intelligent Systems, Lecture Notes inComputer Science, vol. 4203, Springer, Berlin Heidelberg New York, pp. 3846.

    20. Ford R.M. (2005) Fuzzy logic methods for video shot boundary detection andclassication. In: Tan Y.-P., Yap K.H., and Wang L. (eds.) Intelligent Multime-dia Processing with Soft Computing, Studies in Fuzziness and Soft Computing,vol. 168, Springer, Berlin Heidelberg New York, pp. 151169.

    21. Back T. (1996) Evolutionary Algorithms in Theory and Practice: EvolutionStrategies, Evolutionary Programming, Genetic Algorithms. Oxford UniversityPress, New York.

    22. Ming Li and Tong Wang (2005) An approach to image retrieval based on con-cept lattices and rough set theory. Sixth International Conference on Paralleland Distributed Computing, Applications and Technologies, 58 Dec., pp. 845849.

  • 44 A.-E. Hassanien et al.

    23. Kulkarni S. (2004) Neural-fuzzy approach for content-based retrieval of digitalvideo. Canadian Conference on Electrical and Computer Engineering, vol. 4,25 May, pp. 22352238.

    24. Hui Fang, Jianmin Jiang, and Yue Feng (2006) A fuzzy logic approachfor detection of video shot boundaries. Pattern Recognition, vol. 39, no. 11,pp. 20922100.

    25. Selouani S.-A. and OShaughnessy D. (2003) On the use of evolutionary al-gorithms to improve the robustness of continuous speech recognition systemsin adverse conditions. EURASIP Journal on Applied Signal Processing, vol. 8,pp. 814823

    26. Lo C.-C. and Wang S.-J. (2001) Video segmentation using a histogram-basedfuzzy C-means clustering algorithm. The 10th IEEE International Conferenceon Fuzzy Systems, vol. 2, 25 Dec., pp. 920923.

    27. Cao X. and Suganthan P.N. (2002) Neural network based temporal videosegmentation. International Journal of Neural Systems, vol. 12, no. 34,pp. 263629.

    28. Chang C.-H., Ye Z., and Zhang M. (2005) Fuzzy-ART based adaptive digitalwatermarking scheme. IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 15, no. 1, pp. 6581.

    29. Diego Sal Diaz and Manuel Grana Romay (2005) Introducing a watermarkingwith a multi-objective genetic algorithm. Proceedings of the 2005 conference onGenetic and evolutionary computation, Washington DC, USA, pp. 22192220.

    30. Lou D.-C. and Yin T.-L. (2001) Digital watermarking using fuzzy clusteringtechnique. IEICE Transactions on Fundamentals of Electronics, Communica-tions and Computer Sciences (Japan), vol. E84-A, no. 8, pp. 20522060.

    31. Maher El-arbi, Ben Amar, and C. Nicolas, H. (2006) Video watermarking basedon neural networks. IEEE International Conference on Multimedia and Expo,Toronto, Canada, pp. 15771580.

    32. Der-Chyuan Lou, Jieh-Ming Shieh, and Hao-Kuan Tso (2005) Copyrightprotection scheme based on chaos and secret sharing techniques. Optical En-gineering, vol. 44, no. 11, pp. 117004117010.

    33. Wei Lu, Hongtao Lu, and FuLai Chung (2005) Subsampling-based robust wa-termarking using neural network detector. Advances in Neural Networks, ISNN2005, Lecture Notes in Computer Science, vol. 3497, pp. 801806.

    34. Cheng-Ri Piao, Sehyeong Cho, and Seung-Soo Han (2006) Color image water-marking algorithm using BPN neural networks. Neural Information Processing,Lecture Notes in Computer Science, vol. 4234, pp. 234242

    35. Zheng Liu, Xue Li, and Dong Z. (2004) Multimedia authentication with sensor-based watermarking. Proc. of the international workshop on Multimedia andsecurity, Magdeburg, Germany, pp. 155159

    36. Hung-Hsu Tsai, Ji-Shiung Cheng, and Pao-Ta Yu (2003) Audio watermarkingbased on HAS and neural networks in DCT domain. EURASIP Journal onApplied Signal Processing, vol. 2003, no. 3, pp. 252263

    37. Cao L., Wang X., Wang Z., and Bai S. (2005) Neural network based audiowatermarking algorithm. In: ICMIT 2005: Information Systems and SignalProcessing, Wei Y., Chong K.T., Takahashi T. (eds.), Proceedings of the SPIE,vol. 6041, pp. 175179

    38. Alessandro Bugatti, Alessandra Flammini, and Pierangelo Migliorati (2002)Audio classication in speech and music: A comparison between a statisti-

  • Computational Intelligence in Multimedia Processing 45

    cal and a neural approach. EURASIP Journal on Applied Signal Processing,vol. 2002, no. 4, pp. 372378.

    39. Lim Ee Hui, Seng K.P., and Tse K.M. (2004) RBF Neural network mouthtracking for audiovisual speech recognition system. IEEE Region 10 Confer-ence TENCON2004, 2124 Nov., pp. 8487.

    40. Jian Zhou, Guoyin Wang, Yong Yang, and Peijun Chen (2006) Speech emotionrecognition based on rough set and SVM. 5th IEEE International Conferenceon Cognitive Informatics ICCI 2006, 1719 July, vol. 1, pp. 5361.

    41. Faraj M.-I. and Bigun J. (2007) Audiovisual person authentication using lip-motion from orientation maps. Pattern Recognition Letters, vol. 28, no. 11,pp. 13681382.

    42. Halavati R., Shouraki S.B., Eshraghi M., Alemzadeh M., and Ziaie P. (2004) Anovel fuzzy approach to speech recognition. Hybrid Intelligent Systems. FourthInternational Conference on Hybrid Intelligent Systems, 58 Dec., pp. 340345.

    43. Eugene I. Bovbel and Dzmitry V. Tsishkou (2000) Belarussian speech recogni-tion using genetic algorithms. Third International Workshop on Text, Speechand Dialogue, Brno, Czech Republic, pp. 185204.

    44. Fellenz W.A., Taylor J.G., Cowie R., Douglas-Cowie E., Piat F., Kollias S.,Orovas C., and Apolloni B. (2000) On emotion recognition of faces and ofspeech using neural networks, fuzzy logic and the ASSESS system. Proceedingsof the IEEE-INNS-ENNS International Joint Conference on Neural Networks,vol. 2, IJCNN 2000, pp. 9398.

    45. Laaksonen J., Koskela M., and Oja E. (2002) PicSOM-Self-organizing imageretrieval with MPEG-7 content descriptions. IEEE Transactions on NeuralNetworks, Special Issue on Intelligent Multimedia Processing vol. 13, no. 4,pp. 841853.

    46. Mats S., Jorma L., Matti P., and Timo H. (2006) Retrieval of multimediaobjects by combining semantic information from visual and textual descriptors.Proceedings of 16th International Conference on Articial Neural Networks(ICANN 2006), pp. 7583, Athens, Greece, September 2006.

    47. Kostek B. and Andrzej C. (2001) Employing fuzzy logic and noisy speechfor automatic tting of hearing aid. 142 Meeting of the Acoustical Society ofAmerica, No. 5, vol. 110, pp. 2680, Fort Lauderdale, USA.

    48. Liu J., Wang Z., and Xiao X. (2007) A hybrid SVM/DDBHMM decision fusionmodeling for robust continuous digital speech recognition. Pattern RecognitionLetter, vol. 28, No. 8, pp. 912920.

    49. Ing-Jr Ding (2007) Incremental MLLR speaker adaptation by fuzzy logic con-trol. Pattern Recognition, vol. 40 , no. 11, pp. 31103119

    50. Andrzej C. (2003) Automatic identication of sound source position employ-ing neural networks and rough sets. Pattern Recognition Letters, vol. 24,pp. 921933.

    51. Andrzej C., Kostek B., and Henryk S. (2002) Diagnostic system for speecharticulation and speech understanding. 144th Meeting of the Acoustical Societyof America (First Pan-American/Iberian Meeting on Acoustics), Journal of theAcoustical Society of America, vol. 112, no. 5, Cancun, Mexico.

    52. Andrzej C., Andrzej K., and Kostek B. (2003) Intelligent processing ofstuttered speech. Journal of Intelligent Information Systems, vol. 21, no. 2,pp. 143171.

  • 46 A.-E. Hassanien et al.

    53. Pawel Zwan, Piotr Szczuko, Bozena Kostek, and Andrzej Czyzewski (2007)Automatic singing voice recognition employing neural networks and rough sets.RSEISP 2007, pp. 793802.

    54. Andrzej C. and Marek S. (2002) Pitch estimation enhancement employingneural network-based music prediction. Proc. IASTED Intern. Conference, Ar-ticial Intelligence and Soft Computing, pp. 413418, Ban, Canada.

    55. Hendessi F., Ghayoori A., and Gulliver T.A. (2005) A speech synthesizerfor Persian text using a neural network with a smooth ergodic HMM. ACMTransactions on Asian Language Information Processing (TALIP), vol. 4, no. 1,pp. 3852.

    56. Orhan Karaali, Gerald Corrigan, and Ira Gerson (1996) Speech synthesis withneural networks. World Congress on Neural Networks, San Diego, Sept. 1996,pp. 4550.

    57. Corrigan G., Massey N., and Schnurr O. (2000) Transition-based speech syn-thesis using neural networks. Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 945948.

    58. Shan Meng and Youwei Zhang (2003) A method of visual speech feature arealocalization. Proceedings of the International Conference on Neural Networksand Signal Processing, 2003, vol. 2, 1417 Dec., pp. 11731176.

    59. Sun-Yuan Kung and Jenq-Neng Hwang (1998) Neural networks for intelligentmultimedia processing. Proceedings of the IEEE Workshop on Neural Net-worksm, vol. 86, no. 6, pp. 12441272.

    60. Frankel J., Richmond K., King S., and Taylor P. (2000) An automatic speechrecognition system using neural networks and linear dynamic models to recoverand model articulatory traces. Proc. ICSLP, 2000.

    61. Schuller B., Reiter S., and Rigoll G. (2006) Evolutionary feature generation inspeech emotion. IEEE International Conference on Recognition Multimedia,pp. 58.

    62. Lewis T.W. and Powers D.M.W., Audiovisual speech recognition using red ex-clusion and neural networks. Proceedings of the twenty-fth Australasian con-ference on Computer science, vol. 4, Melbourne, Victoria, Australia, pp. 149156.

    63. Nakamura S. (2002) Statistical multimodal integration for audiovisual speechprocessing. IEEE Transactions on Neural Networks, vol. 13, no. 4, pp. 854866.

    64. Guido R.C., Pereira J.C., and Slaets J.F.W. (2007) Advances on pattern recog-nition for speech and audio processing. Pattern Recognition Letters, vol. 28,no. 11, pp. 12831284.

    65. Vahideh Sadat Sadeghi and Khashayar Yaghmaie (2006) Vowel recognitionusing neural networks. International Journal of Computer Science and NetworkSecurity (IJCSNS), vol. 6, no. 12, pp. 154158.

    66. Hartigan J.A. and Wong M.A. (1979) Algorithm AS136: A K-means clusteringalgorithm. Applied Statistics, vol. 28, pp. 100108.

    67. Henry C. and Peters J.F. (2007) Image pattern recognition using approxima-tion spaces and near sets. In: Proceedings of Eleventh International Conferenceon Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC2007), Joint Rough Set Symposium (JRS 2007), Lecture Notes in ArticialIntelligence, vol. 4482, pp. 475482.

    68. Kerre E. and Nachtegael M. (2000) Fuzzy techniques in image processing:Techniques and applications. Studies in Fuzziness and Soft Computing, vol. 52,Physica, Heidelberg.

  • Computational Intelligence in Multimedia Processing 47

    69. Lingras P. and West C. (2004) Interval set clustering of web users with roughK-means. Journal of Intelligent Information Systems, vol. 23, no. 1, pp. 516.

    70. Lingras P. (2007) Applications of rough set based K-means, Kohonen, GAClustering. Transactions on Rough Sets, VII, pp. 120139.

    71. Mitra Sushmita (2004) An evolutionary rough partitive clustering. PatternRecognition Letters, vol. 25, pp. 14391449.

    72. Ng H.P., Ong S.H., Foong K.W.C., Goh P.S., and Nowinski, W.L. (2006) Med-ical image segmentation using K-means clustering and improved watershedalgorithm. IEEE Southwest Symposium on Image Analysis and Interpretation,pp. 6165.

    73. Nachtegael M., Van-Der-Weken M., Van-De-Ville D., Kerre D., Philips W., andLemahieu I. (2001) An overview of classical and fuzzy-classical lters for noisereduction. 10th International IEEE Conference on Fuzzy Systems FUZZ-IEEE2001, Melbourne, Australia, pp. 36.

    74. Ning S., Ziarko W., Hamilton J., and Cercone N. (1995) Using rough sets astools for knowledge discovery. In: Fayyad U.M. and Uthurusamy R. (eds.), FirstInternational Conference on Knowledge Discovery and Data Mining KDD95,Montreal, Canada, AAAI, pp. 263268.

    75. Pawlak Z. (1991) Rough sets Theoretical aspects of reasoning about data.Kluwer, Dordrecht.

    76. Pawlak Z., Grzymala-Busse J., Slowinski R., and Ziarko W. (1995) Rough sets.Communications of the ACM, vol. 38, no. 11, pp. 8895.

    77. Polkowski L. (2003) Rough Sets: Mathematical Foundations. Physica,Heidelberg.

    78. Peters J.F. (2007) Near sets: Special theory about nearness of objects. Funda-menta Informaticae, vol. 75, no. 14, pp. 407433.

    79. Peters J.F. (2007) Near sets. General theory about nearness of objects. AppliedMathematical Sciences, vol. 1, no. 53, pp. 26092029.

    80. Peters J.F., Skowron A., and Stepaniuk J. (2007) Nearness of objects: Ex-tension of approximation space model. Fundamenta Informaticae, vol. 79,pp. 116.

    81. Peters J.F. (2007) Near sets. Toward approximation space-based object recog-nition, In: Yao Y., Lingras P., Wu W.-Z, Szczuka M., Cercone N., Slezak D.(eds.), Proc. of the Second Int. Conf. on Rough Sets and Knowledge Tech-nology (RSKT07), Joint Rough Set Symposium (JRS07), Lecture Notes inArticial Intelligence, vol. 4481, Springer, Berlin Heidelberg New York,pp. 2233.

    82. Peters J.F. and Ramanna S. (2007) Feature selection: Near set approach. In:Ras Z.W., Tsumoto S., and Zighed D.A. (eds.) 3rd Int. Workshop on Min-ing Complex Data (MCD07), ECML/PKDD-2007, Lecture Notes in ArticialIntelligence, Springer, Berlin Heidelberg New York, in press.

    83. Peters J.F., Skowron A., and Stepaniuk J. (2006) Nearness in approximationspaces. In: Lindemann G., Schlilnglo H. et al. (eds.), Proc. Concurrency,Specication & Programming (CS&P2006), Informatik-Berichte Nr. 206,Humboldt-Universitat zu Berlin, pp. 434445.

    84. Orlowska E. (1982) Semantics of vague concepts. Applications of rough sets.Institute for Computer Science, Polish Academy of Sciences, Report 469,1982. See, also, Orlowska E., Semantics of vague concepts, In: Dorn G. andWeingartner P. (eds.), Foundations of Logic and Linguistics. Problems andSolutions, Plenum, London, 1985, pp. 465482.

  • 48 A.-E. Hassanien et al.

    85. Orlowska E. (1990) Verisimilitude based on concept analysis. Studia Logica,vol. 49, no. 3, pp. 307320.

    86. Pawlak Z. (1981) Classication of objects by means of attributes. Institute forComputer Science, Polish Academy of Sciences, Report 429, 1981.

    87. Pawlak Z. (1982) Rough sets. International Journal of Computing and Infor-mation Sciences, vol. 11, pp. 341356.

    88. Pawlak Z. and Skowron A. (2007) Rudiments of rough sets. InformationSciences, vol. 177, pp. 327.

    89. Peters J.F. (2008) Classication of perceptual objects by means of features.International Journal of Information Technology and Intelligent Computing,vol. 3, no. 2, pp. 135.

    90. Lockery D. and Peters J.F. (2007) Robotic target tracking with approxima-tion space-based feedback during reinforcement learning. In: Proceedings ofEleventh International Conference on Rough Sets, Fuzzy Sets, Data Miningand Granular Computing (RSFDGrC 2007), Joint Rough Set Symposium (JRS2007), Lecture Notes in Articial Intelligence, vol. 4482, pp. 483490.

    91. Peters J.F., Borkowski M., Henry C., and Lockery D. (2006) Monocular vi-sion system that learns with approximation spaces. In: Ella A., Lingras P.,Slezak D., and Suraj Z. (eds.), Rough Set Computing: Toward Perception BasedComputing, Idea Group Publishing, Hershey, PA, pp. 122.

    92. Peters J.F., Borkowski M., Henry C., Lockery D., Gunderson D., andRamanna S. (2006) Line-crawling bots that inspect electric power transmis-sion line equipment. Proc. 3rd Int. Conf. on Autonomous Robots and Agents2006 (ICARA 2006), Palmerston North, NZ, 2006, pp. 3944.

    93. Peters J.F. (2008) Approximation and perception in ethology-based reinforce-ment learning. In: Pedrycz W., Skowron A., and Kreinovich V. (eds.), Hand-book on Granular Computing, Wiley, New York, Ch. 30, pp. 141.

    94. Peters J.F. and Borkowski M. (2004) K-means indiscernibility relation overpixels. Proc. 4th Int. Conf. on Rough Sets and Current Trends in Computing(RSCTC 2004), Uppsala, Sweden, 15 June, pp. 580585.

    95. Peters J.F. and Pedrycz W. (2007) Computational intelligence. In: EEEEncyclopedia. Wiley, New York, in press.

    96. Peters J.F., Liting H., and Ramanna S. (2001) Rough neural computing insignal analysis. Computational Intelligence, vol. 17, no. 3, pp. 493513.

    97. Peters J.F., Skowron A., Suraj Z., Rzasa W., Borkowski M. (2002) Cluster-ing: A rough set approach to constructing information granules. Soft Comput-ing and Distributed Processing. Proceedings of 6th International Conference,SCDP 2002, pp. 5761.

    98. Petrosino A. and Salvi G. (2006) Rough fuzzy set based scale space trans-forms and their use in image analysis. International Journal of ApproximateReasoning, vol. 41, no. 2, pp. 212228.

    99. Shankar B.U. (2007) Novel classication and segmentation techniques withapplication to remotely sensed images. Transactions on Rough Sets, vol. VII,LNCS 4400, pp. 295380.

    100. Otto C.W. (2007) Motivating rehabilitation exercise using instrumented ob-jects to play video games via a congurable universal translation peripheral,M.Sc. Thesis, Supervisors: Peters J.F. and Szturm T., Department of Electricaland Computer Engineering, University of Manitoba, 2007.

  • Computational Intelligence in Multimedia Processing 49

    101. Szturm T., Peters J.F., Otto C., Kapadia N., and Desai A. (2008) Task-specic rehabilitation of nger-hand function using interactive computergaming, Archives for Physical Medicine and Rehabilitation, submitted.

    102. Sandeep Chandana and Rene V. Mayorga (2006) RANFIS: Rough adaptiveneuro-fuzzy inference system. International Journal of Computational Intelli-gence, vol. 3, no. 4, pp. 289295.

    103. Swagatam Das, Ajith Abraham, and Subir Kumar Sarkar (2006) A hybridrough set Particle swarm algorithm for image pixel classication. Proceedingsof the Sixth International Conference on Hybrid Intelligent Systems, 1315Dec., pp. 2632.

    104. Bezdek J.C., Ehrlich R., and Full W. (1984) FCM: The fuzzy C-means clus-tering algorithm. Computers and Geosciences, vol. 10, pp. 191203.

    105. Cetin O., Kantor A., King S., Bartels C., Magimai-Doss M., Frankel J., andLivescu K. (2007) An articulatory feature-based tandem approach and factoredobservation modeling. IEEE International Conference on Acoustics, Speech andSignal, ICASSP2007, Honolulu, HI, vol. 4, pp. IV-645IV-648.

    106. Raducanu B., Grana M., and Sussner P. (2001) Morphological neural networksfor vision based self-localization. IEEE International Conference on Roboticsand Automation, ICRA2001, vol. 2, pp. 20592064.

    107. Ahmed M.N., Yamany S.M., Nevin M., and Farag A.A. (2003) A modiedfuzzy C-means algorithm for bias eld estimation and segmentation of MRIdata. IEEE Transactions on Medical Imaging, vol. 21, no. 3, pp. 193199.

    108. Yan M.X.H. and Karp J.S. (1994) Segmentation of 3D brain MR using anadaptive K-means clustering algorithm. IEEE Conference on Nuclear ScienceSymposium and Medical Imaging, vol. 4, pp. 15291533.

    109. Voges K.E., Pope N.K.L.I., and Brown M.R. (2002) Cluster analysis of mar-keting data: A comparison of K-means, rough set, and rough genetic ap-proaches. In: Abbas H.A., Sarker R.A., and Newton C.S. (eds.), Heuristics andOptimization for Knowledge Discovery, Idea Group Publishing, pp. 208216.

    110. Chen C.W., Luo J.B., and Parker K.J. (1998) Image segmentation via adaptiveK-mean clustering and knowledge-based morphological operations with bio-medical applications. IEEE Transactions on Image Processing, vol. 7, no. 12,pp. 16731683.

    111. Davis K.J. and Najarian K. (2001) Maximizing strength of digital watermarksusing neural networks. International Joint Conference on Neural Networks,IJCNN 2001, vol. 4, pp. 28932898.

    112. Sankar K. Pal (2001) Fuzzy image processing and recognition: Uncertaintieshandling and applications. International Journal of Image and Graphics, vol. 1,no. 2, pp. 169195.

    113. Yixin Chen and James Z. Wang (2002) A region-based fuzzy feature match-ing approach to content-based image retrieval. IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 24, no. 9, pp. 12521267.

    114. Yu Wanga Mingyue Dingb, Chengping Zhoub, and Ying Hub (2006) Interactiverelevance feedback mechanism for image retrieval using rough set. Knowledge-Based Systems, vol. 19, no. 8, pp. 696703.

    115. Zadeh L.A. (1965) Fuzzy sets. Information and Control, vol. 8, pp. 338353.116. Zbigniew W. (1987) Rough approximation of shapes in pattern recognition.

    Computer Vision, Graphics, and Image Processing, vol. 40, no. 2, pp. 228249.

  • Computational Intelligence in MultimediaNetworking and Communications: Trendsand Future Directions

    Parthasarathy Guturu

    Electrical Engineering Department, University of North Texas, Denton,TX 76207-7102, USAguturu@unt.edu

    This paper presents a review of the current literature on computational intel-ligence based approaches to various problems in multimedia networking andcommunications such as call admission control, management of resources andtrac, routing, multicasting, media composition, encoding, media streamingand synchronization, and on-demand servers and services. Challenges to beaddressed and future directions of research are also presented.

    1 Introduction

    We currently live in an age of information revolution. With high impact ap-plications launched every day in various elds such as e-commerce, entertain-ment, education, medicine, defense, and homeland security, there has been anexplosive growth in the demand for exchange of various forms of information,text, graphics, audio, video, etc. collectively termed as multimedia. Colossalamounts of multimedia data that need to be transmitted over the Internet, inturn, necessitate smart multimedia communication methods with capabilitiesto manage resources eectively, reason under uncertainty, and handle impre-cise or incomplete information. To this end, many multimedia researchers inrecent times have developed computational intelligence (CI) based methodsfor various aspects of multimedia communications. The objective of this bookchapter is to present to the multimedia research community the state of theart in these CI applications to multimedia communications and networking,and motivate research in new trend-setting directions. Hence, we review inthe following sections some representative CI methods for quality of service(QoS) provisioning by call/connection admission control, adaptive allocationof resources and trac management. Some important contributions to mul-ticast routing, multimedia composition, streaming and media synchroniza-tion, and multimedia services/servers are also surveyed. Most of the methods

    P. Guturu: Computational Intelligence in Multimedia Networking and Communications: Trends

    and Future Directions, Studies in Computational Intelligence (SCI) 96, 5176 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 52 P. Guturu

    available in the current literature are either fuzzy or neural network basedthough some papers adopted a hybrid approach of using neuro-fuzzy con-trollers. A few papers present genetic/evolutionary methods for problems inmultimedia communications. From these applications, it appears that the var-ious computational intelligence frameworks are not competitive, but rathercomplementary. For the sake of completeness, we present a brief review of thecomputational intelligence paradigm in the following subsection.

    1.1 Computational Intelligence Paradigm

    According to Wikipedia, the free online encyclopedia, computational intelli-gence (CI) is a branch of articial intelligence (AI) that combines elementsof learning, adaptation, evolution and fuzzy logic (as well as rough sets) tocreate programs equipped with intelligence to solve problems eectively. Ituses meta-heuristic algorithms and strategies such as statistical learning ma-chines, fuzzy systems, neural networks, evolutionary computation, swarm in-telligence, articial immune systems, etc. In contrast, the traditional AI (or,GOFAI, i.e., good old-fashioned articial intelligence, as per the term coinedby John Haugeland, professor of philosophy at the University of Chicago), re-lies on symbolic approaches. In this subsection, we present an overview of onlythose CI techniques that have been used in the multimedia communicationand network research documents cited in the present survey.

    Neural Networks

    An articial neural network (ANN) or simply neural network (NN) is an inter-connected set of simple nonlinear processing elements called neurons becauseof their role similar to neurons in a biological system. The neurons in anANN take inputs from either external environment or other neurons in thesystem. The neuronal outputs may similarly be transmitted to either otherneurons (through interconnection weights) or external environment. The neu-rons that take inputs from and send outputs to exclusively other neurons arecalled hidden neurons. These hidden neurons have been found to be pivotal tolearning of complex inputoutput mappings. The methods for adaptation ofinter-neuron weights based on the observed outputs to obtain desired outputsare called NN training or learning methods. The NN interconnection patternsare called topologies. The most popular NN topology and the associated learn-ing algorithm are feed-forward neural network (FFNN) and back-propagationlearning (BPL) algorithm, respectively. FFNN is also known as multi-layerperceptron (MLP). In an FFNN, neurons are arranged into multiple layersconsisting of an input, an output, and one or more hidden layers with unidi-rectional inter-layer neuronal connections (weights) from the input through tothe output layer as shown in Fig. 1. Determination of inter-layer connectiontopologies, and the number of hidden layers as well as the number of neuronsin each of them, based on the problem being solved, are open research issues.

  • Computational Intelligence in Multimedia Networking and Communications 53



    Hidden Layers


    Fig. 1. A typical four layer feed-forward neural network

    Still, simple three-layer FFNNs with total inter-connectivity between neuronsin consecutive layers as shown in the gure have been successfully applied tomultimedia and other applications where system adaptability and capabilityto learn complex functional dependencies of outputs on inputs are of para-mount importance. The standard BPL algorithm used for training the FFNNinterconnection weights is a supervised learning algorithm, i.e., one with atraining set of inputoutput pairs). In this algorithm, the errors are rst com-puted at the output layer as the dierences between the desired and observedoutputs for training sample inputs, and then the inter-neuronal connectionweights from the neurons in the layer preceding the output layer to thosein the output layer are updated (using mathematical formulae) to producethe desired outputs. The errors in the outputs of the previous stage neuronsare also similarly computed, and the process of computing the weights andthe neuron output is repeated for dierent layers in the FFNN proceeding inthe backward direction till the input layer is reached. A detailed discussionof the FFNNs and the BPL may be found in [1].

    A recurrent neural network is a generalized neural network in which bidi-rectional asymmetric interneuronal connections are possible; it does not needto have a layered organization of neurons. A recurrent NN training algo-rithm, which is similar to the BPL (because of almost the same mathematicalformulae for updating interneuronal weights) and hence known as the recur-rent back-propagation (RBP) algorithm, has been proposed independently byAlmeida [2], and Pineda [3]. A special form of recurrent NN is the Hopeldneural net (HNN) [4], which uses binary threshold gates as processing elements(neurons), a totally connected network topology, and symmetric interneuronalconnection weights. An HNN network may be congured to nd the local op-tima (minima) of criterion functions in some problems if those functions canbe cast in the form of the following energy function related the Ising model [5]in physics:

  • 54 P. Guturu

    E = 12


  • Computational Intelligence in Multimedia Networking and Communications 55

    Reinforcement Learning

    Mathematically, the reinforcement learning (RL) [8] system model is a tripletMRIL = {S,A,R} where S is the set of states of a problem environment, A isthe set actions that can be taken by an agent seeking to solve the problem, andR is the set of scalar rewards associated with an action and the current stateof the system. In the RL paradigm, an agent perceives, at each time instant t,the current state st( S) of the environment and the set of actions A(st) Athat can be taken based on that state, and chooses an action a A(st).The chosen action results in a reward r R, and drives the environmentinto a new state st+1. An RL formulation seeks to determine the optimalpolicy (or the series of actions the agent needs to take) to maximize thetotal reward. In this formulation, there is no concept of supervised learningof optimal system parameters by means of corrective actions based on theerrors in the observed system outputs for a given training set of inputoutputpairs. Instead, the choice of actions is aided by the nite-state Markov decisionprocess (MDP) [9] model of the environment. Even though it is not necessaryto make use of ANNs for implementation of an RL formulation, it is usuallythe case to include them as a part of the solution.

    Fuzzy Logic Based Intelligent Control

    In any control system, the actions to control some aspect of system perfor-mance, e.g., congestion control, maximum end-to-end delay in the network)are based on the system inputs, e.g., message packet loss, link delay). How-ever, for robust control, the system needs to be capable of managing theuncertainties in the system environment and imprecisions in the system inputmeasurements. A mathematical formalism useful for the design of such robustsystems was pioneered by Zadeh [10], and was rst applied by Mamdani [11]to control system design. It is variously known as fuzzy set theory or fuzzylogic. The logic variables, e.g., congestion) in fuzzy set theory do not takecrisp binary (false or true) values, but take continuous values (called mem-bership values) in the range [0,1]. Another class of fuzzy logic variables thatcapture our vague, ambiguous, qualitative or subjective view of the worldare termed as linguistic variables. They may be loosely dened as variablesthat take graded membership or simply linguistic values, e.g., high, medium,low, rough, smooth, etc.). Modern fuzzy control systems use a set of fuzzylinguistic rules of the form given below to derive inferences about the outputvariables from the input variables and use the output estimates so obtainedfor control:

    If packet loss is low, and network delay is high, network congestion is medium.

    Since such rules are gathered from experts in the eld, such control systemscome under the class of Fuzzy Expert Systems. Each rule (proposition) in

  • 56 P. Guturu

    Fig. 2. Block diagram of a fuzzy logic system

    the rule base of the fuzzy control system is associated with a membershipvalue (degree of truth) in the range [0,1]. The membership values of variousapplicable propositions are aggregated by a properly designed inference en-gine in a fuzzy logic system and the system outputs are estimated. Figure 2depicts the block diagram of a typical fuzzy logic based system for obtainingthe control parameters with the problem state vector as its input.

    The fuzzier module in the system converts the crisp input values intolinguistic values such as high and medium so that inference engine can gen-erate the fuzzy values for the output parameters using rules such as the oneindicated above from the rule base. In case of applicability of more than onerule, the membership values of dierent rules are aggregated to obtain a con-sistent estimate. The defuzzier then converts the fuzzy values of the outputvariables into crisp values. In this article, a few neuro-fuzzy applications tomultimedia communications are also presented. These methods typically useneural networks for learning both the rules and the membership functionsassociated with the rules.

    Rough Sets

    Rough set theory (RST) is another approximate reasoning formalismdeveloped for handling imprecision. Here, the values of a set of attributesare represented by two sets: one for the lower and the other for the upperapproximation the original crisp set. Even though the upper and lower sets inthe original formalism of Pawlak [12] are crisp sets, they could as well be fuzzysets. A rough set inference mechanism similar to fuzzy inferencing could beused for estimation of system parameters. The RST uses the an informationsystem framework (ISF) I, which may be dened as a tuple (O,A), where Ois a non-empty set of objects, and A is a set of their attributes. In ISF, aninformation table maps the value tuples (of the attributes) onto the objects.Two objects are dened to be discernible if they can be distriguished basedon their value tuples. This discernability relationship between objects inducesan equivalent partition among the objects. In case each partition so obtainedis a singleton, the every object of the system can be distinguished from thegiven set of attributes. Now, when we consider a subset of the attribute set A,the target set T ( O) of objects cannot be expressible exactly because somesubsets of objects in T could be indiscernable. Hence a rough set involving

  • Computational Intelligence in Multimedia Networking and Communications 57

    an upper approximation set of objects possibly in the target and a lowerapproximation set of objects positively in the target may be used to representthe target. From the rough sets so constructed, it is possible to obtain reductsubsets of attributes, that is, the subsets of attributes, each of which inducethe same equivalent partition on O as the original set A. The reduct subset isnot unique, because dierent subsets of A could induce the same equivalentpartition. The intersection of such reduct subsets gives the core (or indispens-able) set of attributes of the information system I. Similarly, when the unionof all reduct sets is removed from A, we get the set of superuous attributes.Thus the rough set is a useful tool for capturing the knowledge representedin the information system with lesser number of attributes.

    Dubois [13] extended the formalism of RST by introducing rough fuzzysets and fuzzy rough sets. Among the applications of RST, the RST basedapproaches proposed by Stefanowski [14] for induction of decision rules, andZiarkos [15] rough set methodology for data mining are worth mentioning.

    Evolutionary Computation

    Evolutionary computation (EC) is the generic name for a number of alliedbiology-inspired technologies such as evolutionary programming (EP) [16],genetic algorithms (GAs) [17], evolution strategies (ES) [18, 19], and geneticprogramming (GP) [20]. The goal of an EC algorithm is to nd a quasi-optimalsolution to a problem by mimicking the genetic evolutionary processes. Partic-ularly when the optimality measure on a large set of variables characterizingthe problem solution turns out to be a non-convex multi-modal function, anexhaustive search for an optimal solution is ruled out because of exponentialsearch complexity. In this situation, the EC approach is an eective strategyfor intelligent exploration of the search space to nd near-optimal solutions. Inthis approach, the search starts with an initial population of candidate solu-tions, each represented by a vector of randomly chosen values for the problemsolution variables. Now, based on an analogy between the process for obtainingoptimal solution and genetic evolution, the solution vector may be consideredas an equivalent of a chromosome with individual components of the vectorrepresenting the genes. The optimality measures (or equivalently, the tnessfunctions) of the individual candidate solutions in any generation includingthe initial one may be computed using the functional form of the measure,and the solutions may be ordered based on their tness values. The candidatesolutions for the next generation (osprings) may then be obtained by using acrossover operation on parent solutions usually chosen from the population ofthe current generation, using the so-called elitist strategy, that is, the strat-egy of choosing the participants of the cross-over randomly from a selectedfew members (with the highest tness value) of the population of the cur-rent generation. The traditional crossover is based on the concept of exchangeof the genetic material between parents (without many constraints), but onecan also design a new crossover mechanism, based on a given optimization

  • 58 P. Guturu

    problem being dealt with. Crossover points are also chosen randomly. An-other very important genetic operator, next to the above discussed crossover,used in EAs is mutation. It simulates genetic mutation by replacing the valueof a randomly chosen component of solution vector with a new value fromthe set of values admitted for the component. By producing successive gen-erations of new populations by selective replacement of the members of anold generation by the ttest among the new members (osprings) producedwith the help of the two operators discussed above, genetic evolution may becontinued for a number of generations to obtain solutions closer and closerto optimality. Problem representation, design of crossover and mutation op-erators, strategies for replacement of the members of the old population withnew members, and optimal choice of EC parameters such as population size,number of generations for evolution, etc. are open research issues of this area.

    2 Call/Connection Admission Control

    Call admission control (CAC) is a mechanism to determine whether resourcesrequested by an incoming multimedia call could be reserved without adverselyaecting the QoS requirements of the on-going calls. In ATM multimedia net-works, this is tantamount to connection admission control (with the sameabbreviation: CAC), which is a decision-making process to accept or reject arequest for a new virtual path (VP) or virtual channel (VC) based on the an-ticipated trac characteristics of the connection, the requested QoS, and thecurrent state of the network. Traditional CAC schemes make use of variouscriteria for call/connection admission such as equivalent capacity or band-width requirements of various links, maximum allowable cell loss probability,network trac load, etc. To address some deciencies of these methods suchas failure to meet QoS requirements in heavy trac conditions, computationalintensiveness of the call parameter estimation methods, etc. a number of CI-based CAC methods have been proposed in the literature. In the sequel, wediscuss a few representative methods.

    One of the earliest CI-based approaches to multimedia CAC in ATMnetworks is due to Hiramatsu [21]. In this approach, he uses a three-layerfeed-forward neural network (FFNN) with the standard back-propagationlearning algorithm to obtain predicted service quality parameters and callacceptance/rejection decision values (such as predicted values for call arrivalrate and cell loss rate, and the call rejection rate) with observed multiplexerstatus parameters such as cell arrival rate, cell loss rate (CLR), call gener-ation rate, trunk utilization rate, number of connected calls, as the FFNNinputs. Simulation results indicate the adaptability of the proposed methodin learning complex admission control decision policies. In [22], he addressesthe problem of training neural networks with exponentially wide ranged QoSparameters, e.g., CLR ranging from 1012 to 1) using two methods: (i) train-ing with a relative target: here the neural network is assumed to memorize the

  • Computational Intelligence in Multimedia Networking and Communications 59

    logarithm of the average of K-recent monitored QoS values, and the new targetis an updated average derived from a weighted summation of a new sampleand the memorized QoS, and (ii) virtual output buer method wherein aneural network is trained to accurately estimate the QoS for the actual buerby incremental extrapolations using the data from smaller capability virtualbuers (a set of counters simulating an imaginary cell buering process).

    Youssef, Habib, and Saadawi [23] propose a neurocomputing approach toCAC and bandwidth allocation in ATM networks. The algorithm proposed bythem employs a hierarchical structure of a bank of small-sized parallel neuralnetwork (NN) units to calculate eciently the bandwidth required to supportmultimedia trac with multiple QOS requirements. Each NN unit is a FFNNtrained using the standard back-propagation algorithm the complex nonlinearfunction relating dierent trac patterns and QoS, with the correspondingreceived capacity. The NN controller calculates the gain obtained from mul-tiplexing multiple streams of trac supported on separate virtual paths, i.e.,class multiplexing) also so as to enhance the statistical multiplexing gain.The authors use simulation results to prove that their NN approach is moreaccurate in bandwidth calculations and consequently CAC decision-makingcompared to two conventional approaches that use the stationary state ap-proximation of the equivalent capacity method [24], and class related rule [25],respectively.

    References [26] and [27] independently propose fuzzy approaches for esti-mation of the CLR, which is an important CAC parameter. In [26], Ueharaand Hirota estimate possibility distribution of the CLR as a function of thenumber of calls in dierent transmission rate classes. Starting with an initialfuzzy rule base for CLR estimation, successive generations of fuzzy inferencerules are generated by incremental updates based on the CLR data observedfrom time to time. A back-propagation algorithm (unrelated to the neuralnetwork algorithm with the same name) is used for tuning the rule base withnew data. Using fuzzy -cut theory [28], self-compensation of CLR estimationerrors is achieved, and then, by applying the latest rule base, an upper boundon the CLR estimate is obtained and used for CAC decision making. Bensaouet al. [27] propose a robust fuzzy based algorithm to predict the CLR in ATMmultiplexers, and use the CLR estimate so obtained for call admission con-trol. Unlike many traditional approaches, their method does not presume anyinput trac model or parameters, but employs the knowledge of a set of CLRvalues for small values of an independent variable, e.g., multiplexer buer sizeor service capacity) of the CLR function, in conjunction with the knowledgeof the asymptotic behavior of the function for larger values of the variable.

    In [29], Ren and Ramamurthy propose a dynamic CAC (for ATM multime-dia networks) that employs fuzzy logic to combine a CAC based on the UPC(user parameter control) model with that based on measured online tracstatistics for determining the dynamic equivalent bandwidth used in CAC de-cision making. Simulation results indicate that substantially improved system

  • 60 P. Guturu

    utilization can be achieved with dynamic CAC compared to a model-based ora measurement-based CAC.

    Liang, Karnik, and Mendel [30] propose an interesting connection admis-sion control algorithm for ATM networks that uses type-2 fuzzy logic rulebase incorporating the knowledge obtained from 30 network experts. In type-2 fuzzy logic, the membership value of each element in the fuzzy set is itselffuzzy. The type-2 fuzzy logic used in their system , in contrast to type-1 fuzzylogic, provides soft decision boundaries, and thereby permits tradeo betweenCLR and bandwidth utilization.

    Cheng, Chang, and their coworkers are one of the earliest to adopt ahybrid CI approach to call admission control. An IEEE journal article [31]and a US patent document [32] together present the details of their neuralfuzzy CAC (NFCAC). The NFCAC takes in three linguistic inputs, availablecapacity, congestion indicator, and cell loss ratio, and outputs a decision signalto accept or reject a new call request. The fuzzy estimates of the availablecapacity and the congestion indicator are, in turn, done by a fuzzy bandwidthestimator and a fuzzy congestion controller proposed in their earlier work[33]. The NFCAC is a ve layered feed-forward neural network with a two-phase hybrid learning algorithm. Construction of fuzzy rules and location ofmembership functions is done by a self-organized learning scheme in phase-Iwhereas optimal adjustment of membership functions for desired outputs isdone by a supervised learning scheme in phase-II. The authors show by meansof simulation results that their NFCAC, despite the simplicity of its design,can satisfy the QoS requirements, and still achieve higher system utilizationand learning speed compared to a traditional eective-bandwidth-based CAC[34], and the fuzzy-logic-based [33] and neural-net-based [35] CACs proposedby them earlier.

    In [36], Chatovich, Oktug, and Dundar propose a hierarchical neural-fuzzy connection admission controller for ATM multimedia networks. ThisCAC is based on Berenji and Khedkars GARIC (Generalized ApproximateReasoning-based Intelligent Controller) architecture [37] that includes two co-operating neural networks, one called ASN (Action Selection Network) forimplementing fuzzy inference rules initially acquired from an expert, andthe other called AEN (Action Evaluation Network) for performance evalu-ation and ne tuning of the former by the reinforcement learning approach.The ASN is organized as a hierarchical structure that combines three sub-controllers, one for each one of the three system variables, CLR, queuesize, and link utilization, and comes up with the nal decision by weightedaggregation of the decisions of the three sub-controllers.

    In [38], Shen et al. address the problem of bursty wireless multime-dia trac with unpredictable statistical uctuations in wide-band CDMA(Code Division Multiple Access) cellular systems, and propose an intelligentCAC (ICAC) that makes call admission decisions based on QoS parameterssuch as hando call drop probability, outage probabilities of various servicetypes, existing-call interference estimates, the link gain, and the estimate of

  • Computational Intelligence in Multimedia Networking and Communications 61

    equivalent interference of the call request. Estimation of the existing call in-terference in ICAC is done by a pipeline recurrent neural net (PRNN) whichpredicts the mean value of the system interference for the next period asa function of p: measured interference powers, and q: previously predictedinterference powers, where p and q are the fuzzy estimator subsystem para-meters. For equivalent interference estimation, ICAC uses a fuzzy estimatorthat takes in as input four parameters of the new call: peak and mean tracrates, peak trac duration, and the outage probability requirement. The fuzzycall admission processor of ICAC uses the two interference estimates providedby the fuzzy estimator and PRNN in conjunction with other QoS informa-tion to make a four-fold decision: {Strong Accept, Weak Accept, Weak Reject,Strong Reject}. Simulation results comparing ICAC with two traditional CACmethods, namely PSIR-CAC (predicted signal-to-interference ratio CAC) andMCAC (Multimedia CAC), indicate that ICAC achieves higher system capac-ity than PSIR-CAC and MCAC by more than 10% in trac ranges where QoSrequirements are guaranteed. The ICAC has been found to fulll the multipleQoS requirements under all trac load conditions whereas conventional CACschemes fail under heavy trac load conditions.

    Ahn and Ramakrishna [39] propose an interesting Hopeld neural network(HNN) based CAC algorithm for QoS provisioning in wireless multimedianetworks. The QoS provisioning problem is formulated as a multi-objectiveoptimization problem that seeks to maximize the twin objectives of resourceutilization and fair distribution of resources (among dierent connections)subject to the constraint that the total allocated bandwidth cannot exceedthe available capacity. The authors show that the overall objective functioncan be cast into the form of HNN energy function given in the equation ( 1) sothat an HNN with nm neurons (for an n-connection m-QoS level problem)can be set up to minimize the energy function and produce stable and feasibleQoS vector values.

    In [40], Sinouci, Beylot and Pujolle formulate call admission control as asemi-Markov decision problem (SMDP), and develop a reinforcement learning(neuro-dynamic programming) based algorithm for construction of a dynamiccall admission policy. The algorithm is implemented using both table lookupand feed-forward neural network approaches for determination of the Q-values(state-action tuples) of their system based on the number of current calls intwo trac classes, and the characteristics of the new call, e.g., hando ornew, class 1 or class 2 type). Call admission decision (accept or reject) ismade using the action value obtained using this approach, and the systemis trained using the reward associated with success of accepted calls. Theirneural network based CAC is naturally more memory ecient compared tothe table lookup implementation. The proposed method yields an optimalsolution at much higher speed compared to traditional approaches, which arealso dicult to optimize.

    For the reverse link transmission in the wideband code division multi-ple access (WCDMA) cellular systems, Ye, Shen, and Mark propose a CAC

  • 62 P. Guturu

    scheme using fuzzy logic [41]. In their scheme, a fuzzy call admission proces-sor uses the estimates on the eective bandwidth and network resources alongwith the QoS parameters as inputs to output a call acceptance or rejectiondecision. Eective bandwidth, in turn, is estimated by a fuzzy estimator usingcall request parameters and pilot signal strength information as inputs. Pilotsignal strength is also used by a fuzzy estimator to produce mobility estimate,which is used in conjunction with the eective bandwidth and bit energy tonoise-plus-interference density ratio of the trac class under consideration bya fuzzy network resource estimator to produce the network resource estimaterequired by the fuzzy call admission processor. The authors provide simulationresults to compare their approach with two previously proposed traditionalCAC schemes, received power-based call admission control (RPCAC) [42] andnon-predictive call admission control (NPCAC) [43], and demonstrate its ef-fectiveness in terms of new and hando call blocking probabilities, outageprobability, and resource utilization.

    3 Adaptive Allocation and Management of Resources

    Allocation of resources is intimately related to call admission control andQoS management. Hence, in case of multimedia applications requiring highthroughput, it turns out to be a problem of paramount signicance that needsto be handled intelligently. Considering the need for continual revision of band-width allocations to dierent calls in high trac situations, Sherif et al. [44]propose a genetic algorithmic approach to adaptive allocation of resources andcall admission control in wireless ATM networks. In their scheme, QoS require-ments for each of the video, audio, and data sub-streams of a multimedia callcan be specied from a 4-tuple {High, Medium, Low, Stream Dropped} withthe possibility for a total of 64 (43) Q-values for the call as a whole. Assumingthat the maximum to minimum Q-value range for each call is available fromthe user data prole, they formulate the problem of adaptive allocation (incontrast to the traditional static allocation) of bandwidth for existing calls asan optimization problem to minimize the spare capacity (after call allocation)in the cell without either overshooting the cell capacity or going below theminimum Q-value of any call in the cell. This, being a non-convex optimiza-tion problem, has been solved using the genetic approach. Simulation resultsindicate the adaptability of the algorithm to high trac situations, gracefuldegradation of individual user QoS levels with load, and eective and fairutilization of available bandwidth with increased number of admitted calls.

    Yuang and Tien [45] propose an intelligent multiple access control sys-tem (IMACS) with facility for dynamic bandwidth allocation for wirelessATM multimedia networks. The IMACS consists of a multiple access con-troller (MACER), a trac estimator and predictor (TEP), and an intelligentbandwidth allocator (IBA). MACER employs a hybrid-mode TDMA (timedivision multiple access) scheme with advantageous features of CDMA (code

  • Computational Intelligence in Multimedia Networking and Communications 63

    division multiple access) and contention access based on a novel dynamic-tree-splitting collision resolution algorithm parameterized by an optimal splittingdepth (SD). TEP periodically estimates the key Hurst parameter of availablebit rate (ABR) self-similar trac by wavelet analysis, and then predicts themean and variance of subsequent frames using a six-layer neural fuzzy on-line trac predictor (NFTP). Based on these predicted values, IBA performsecient bandwidth allocation by determining the optimal SD, achieving sat-isfactory SCR (Signaling ContRol) blocking probability and ABR throughputrequirements, while retaining maximal aggregate throughput. The NFTP al-gorithm achieves speed by learning in parallel the structure of the fuzzy if-thenrules as well as the parameters for tuning the coecients of the rules to inputtrac dynamics.

    For hierarchical cellular systems supporting multimedia services, Lo,Chang, and Shung [46] propose a neuro-fuzzy radio resources manager, whichessentially contains a neural fuzzy channel allocation processor (NFCAP). Thetwo layer architecture of the NFCAP includes a fuzzy cell selector (FCS) inthe rst layer and a neural fuzzy call-admission and rate controller (NFCRC)in the second layer. Using the user mobility, resource availabilities in bothmicro and macro cells, and hando failure probabilities as input linguisticvariables, the FCS comes up with a cell selection decision. The NFCRC thencomes up with CAC and rate control decisions using the hando probability,and the resource availability of the selected cell as input variables. Authorsestablish through simulations that their method enhances system utilizationby 31.1% with a marginal 2% increase in hando rate compared to overowchannel allocation scheme [47]. Compared to combined channel allocation [48]and fuzzy channel allocation control [49] schemes proposed by them earlier,the NFCAP is shown to provide 6.3 and 1.4% better system utilization andstill achieve hando rate reduction by 14.9 and 6.8%, respectively.

    For third generation wireless multimedia networks demanding highthroughput and QoS guarantees, Moustafa, Habib, and Naghshineh [50]propose an evolutionary computational model based wireless radio resourcemanager (RRM) that controls both the transmission power and the bit rateof the mobile devices cooperatively. Adaptive control of these parameters isachieved by the RRM on a continual basis by solving an optimization problemseeking to maximize a multi-modal objective function expressed as the sumof the total number users satisfying minimum signal quality requirements (as-sessed from their bit error rates), and the total reward for better utilizationof resources, considering the relative reward values, for the correspondingusers, for bandwidth utilization beyond their guaranteed minimum levels andfrugal use of power. Experimental results indicate that their algorithm helpsto reduce the infrastructure costs by requiring fewer base stations becauseof 70% more coverage on the average by each base station implementing thealgorithm. Other benets include signicant decrease (40%) in call blockingrate, longer battery life of the mobile unit because of frugal use of power, andminimal interference among the users.

  • 64 P. Guturu

    Motivated by the need for addressing the scarcity and large uctuationsin the availability of link bandwidth by the development of adaptive multi-media services, Fei, Wong, and Leung [51] propose a reinforcement learningapproach for QoS provisioning by dynamic adjustment of the bandwidth allo-cations for individual ongoing calls. In this paper, the CAC and the dynamicbandwidth allocation problems are formulated as Markov decision processes(MDP) and solved using a form of real-time reinforcement learning schemecalled Q-learning. In their formulation, whenever an event such as new orhando call arrival, call termination, call hando to neighboring cell occurs ina cell, an optimal policy, or equivalently an appropriate set of actions, e.g., ac-ceptance/rejecion of a new/hando call, bandwidth upgrading/downgradingof an ongoing call), is determined by maximization of the expected rewardfunction subject to two QoS constraints- hando dropping probability andaverage allocated bandwidth for a cell. The Q-learning approach permits ef-cient handling of the large state space, i.e., conguration of on-going callsof dierent types at a point in time) of the MDP problem without any priorknowledge of state transition probabilities. Simulation results indicate thatthis algorithm outperforms some traditional approaches [52,53] in bandwidthutilization and call drop reduction.

    4 Multimedia Trac Management and CongestionControl

    Eective management of multimedia trac is essential for guaranteed QoS.The trac management entails a number of operations: (i) call admissioncontrol (CAC), (ii) trac policing, (iii) trac characterization and predic-tion, (iv) rate/ow control, (v) routing and link capacity assignment, and (vi)congestion control. Since CAC is a topic well addressed in the literature onCI methods for multimedia communications, we devoted a separate section inthe beginning for this topic. For similar reasons, we will be dealing the traf-c routing problem separately with particular emphasis on multicast routingin the following subsection. The remaining topics related to trac manage-ment will be considered in this section. For a more comprehensive review ofATM trac management, one may refer to the survey papers of Dodigerisand Develekos [54], and Sekercioglu, Pitsillides, and Vasilakos [55].

    Call admission and resource allocation must necessarily be followed bypolicing of the multimedia network usage, and enforcement of proper usageso as to avert congestion and network delays. This process is also called us-age parameter control (UPC) because it involves continuous monitoring ofthe sources for operation within the limitations of their respective CAC para-meters negotiated during call setup phase. Next, trac characterization andprediction is necessary for both CAC and ow control functions. Proper rout-ing of trac is essential for link capacity management, and hence congestioncontrol. Finally, congestion control may also be done by rate control using a

  • Computational Intelligence in Multimedia Networking and Communications 65

    feedback mechanism. Thus all the trac management functions are closelyinter-linked.

    As in case of CAC, Hiramatsu [56] is one of the earliest researchers toemploy neural networks for integrated ATM trac control also. He proposesthree levels of NNs: (i) cell transfer level NNs for call transmission patternrecognition, service class suggestion, and window size control, (ii) call levelNNs for admission control of bursty calls and multi-level quality control, and(iii) network level NNs for optimal routing, link capacity assignment, andcongestion control. For system eciency, he proposes a two-phase training forthe distributed system of NNs where separate training of individual NNs in therst phase is followed by the training of the whole system in the second phase.Addressing the link capacity assignment problem, he rst estimates the CLRvalues for various links using a bank of three layer FFNNs with call generationrates and logical link capacities for the corresponding links as FFNN inputs.A neural network in the next higher level of hierarchy uses these estimatedCLR values along with logical and physical link capacities as inputs, andperforms multi-objective optimization by seeking to minimize maximum CLRvalue and maximize link utilization. The BP algorithm is used to train theneural network for estimation of CLR. For objective function optimizationby the higher level network, Hiramatsu uses Matyas random optimizationmethod [57].

    In [58], Tarraf, Habib, and Saadawi show how a comprehensive NN so-lution to trac management problem in ATM multimedia networks can beworked out by integrating some of the earlier proposed NN-based methods fordierent aspects of trac management. They divide the trac managementfunctional module into three submodules that operate at cell, call and net-work levels. The states of the buers that maintain various types multimediainformation together with the output of bandwidth assignment module at theUNIs (user-network interface), i.e., the access nodes to the network, providethe information required for processing at three control modules. The calllevel control function is implemented as a two-level hierarchy of feed-forwardnetworks where two rst level neural networks separately compute the ser-vice quality parameters such as call arrival rate, CLR, call rejection rate fromthe declared call parameters and history of the past observed status, and thestatistical parameters of the aggregate link trac from trac measurements.The NN CAC at the second level aggregates the information from the rstlevel neural networks and comes up with a decision to accept or reject a call.The authors propose that the trac characterization and prediction at the celllevel may be done using any of the earlier NN-based methods, e.g., [59, 60])using the states of the UNI buers as the NN inputs. The predicted tracoutputs from the NN then are used by another NN for policing as proposedin [61]. Finally, for the network level trac control, the authors suggest imple-menting either of the two earlier proposed neural network congestion controlmechanisms [62,63].

  • 66 P. Guturu

    Pitsillides, Sekercioglu, and Ramamurthy [64] use peak, minimum, andinitial cell rates obtained by monitoring ABR queue lengths, additive in-crease rate, and control interval as inputs to their fuzzy congestion controlsystem for estimation of ow rates which are provided as feedback to the traf-c sources. Results of simulation experiment to compare their method withthe EPRCA (Enhanced Proportional Rate Control Algorithm) [65] indicatethat their algorithm fares better with respect to end-to-end delay, speed oftransient response, and network utilization.

    Lin et al. [66] propose a genetic algorithm based neural-fuzzy inferencenetwork for extraction of the features characterizing trac at each node ofa binary decision tree that is used in mixed scheduling of ATM trac- ahybrid of rate monotonic and deadline driven approaches. The authors usesimulations to show the eectiveness of the proposed GA based neural fuzzynetwork in learning optimal solutions compared to similar networks trainedwith the BP algorithm.

    In [67], Chen et al. present an approach to trac control that uses a fuzzyARMAX (autoregressive moving-average model with auxiliary input) processfor an eective modeling of nonlinear time-varying and time-delayed proper-ties of multimedia trac. In this model, trac from controllable sources suchas ABR trac represents the fuzzy ARMA component and the uncontrollabletrac such as CBR (Constant Bit Rate), and VBR (Variable Bit Rate) trac,is considered as external disturbance. Simulation results indicate that theirmethod for fuzzy adaptive control of trac ow using trac prediction basedon this model is superior to other competitive approaches with respect to cellloss rate and network utilization.

    5 Routing and Multicast Routing

    As indicated in the previous section, routing is a trac management issuewith impact on QoS at the network level. Multicast routing is a special caseof routing of multimedia streams from a source to a number of destinations;it is pivotal to applications such as video conferencing, tele-classrooms, andvideo on demand. Needless to say, eective multicasting methods are alsoessential for QoS control.

    Park and Lee [68] employ feedback neural networks for multimedia tracrouting. They solve the problem of maximizing the connected cross-points ina crossbar switch for a given trac matrix subject the constraint that onlyone cross-point is permitted to be connected in each row or column of theswitch, by casting the criterion function as an energy function that can beminimized by a Hopeld neural network.

    Zhang and Leung [69] propose a novel genetic algorithm (GA) called or-thogonal GA for multimedia multicast routing under delay constraints. Intheir GA formulation, a chromosome is a multicast tree represented as a bi-nary string of size equal to the cardinality of the network link set. A value

  • Computational Intelligence in Multimedia Networking and Communications 67

    of 1 (or zero) in the string indicates the presence (or absence) of the corre-sponding network link in the multicast tree. As a measure of quality of themulticast tree, the authors propose a tness vector with two components: (i)cumulative path delays in excess of a congured threshold, and (ii) overallcost of the multicast tree. By lexicographic ordering of the vectors based ontheir component values, the multicast trees can be arranged in descendingorder of merit. An important aspect of the GA is an orthogonal crossover andmutation operation to generate j number of osprings from n parents (the socalled n-j crossover and mutation algorithm). Since the osprings so gener-ated may not necessarily be multicast trees (with connections from the sourceto destination nodes), the authors propose a check and repair operation also.Simulation results indicate that their orthogonal GA can nd near optimalsolutions for practical problem sizes.

    In [70], Mao et al. present a genetic algorithmic approach to multi-pathrouting of multi description (actually double description) video in wireless adhoc networks. In the multi-description multimedia encoding scheme that isgaining popularity of late, multiple streams corresponding to multiple equiva-lent descriptions of multimedia content generated from a source are transmit-ted to a destination which can use any subset of the source streams receivedto construct the original multimedia content with a quality commensuratewith the cardinality of the subset used. The authors show that the multi-pathrouting problem is a cross-layer optimization problem where the average videodistortion, i.e., an application layer performance metric, may be expressed asa function of the network layer performance metrics such as bandwidth, loss,and path correlation. Their nal formulation turns out to be a problem of con-strained minimization of average distortion of the received video expressed asa function of individual and joint probabilities for receiving the multiple de-scriptions, and the computable distortions for media reconstruction using thereceived streams. The constraints for their optimization are loop free pathsand stable links. Due to the exponentially complex nature of the problem, theauthors resort to genetic approach for the solution considering each candi-date path constructed by random traversal from the source to destination asa chromosome and nodes (designated by their numbers and positioned in thesame order as on the path) as genes. For the double description video prob-lem, they use chromosome pairs as candidate solutions because two pathsare required for transmitting the two descriptions. For cross over, two suchpath pairs are considered, one string from each pair is randomly chosen, andcross over is performed using the rst common node in the chosen strings asthe crossover point. Mutation on a chromosome (path) pair is similarly doneby choosing one of the strings and a node in the string randomly, and recon-structing the partial path from that node to the destination node by using anyconstructive approach without repeating any nodes in the partial path fromstart node up to (but not including) the chosen node. The authors providesimulation results to demonstrate superior performance of their approach (interms of the average Peak Signal to Noise Ratio of the reconstructed video)

  • 68 P. Guturu

    over several other approaches including the 2-Shortest path [71] and DisjointPath-set Selection Protocol (DPSP) [72] algorithms.

    In [73], Wang and Pavlou formulate group multicast content delivery inmultimedia networks as an integer programming problem to compute a setof bandwidth constrained Steiner trees [74] with minimum overall bandwidthconsumption. Authors propose to represent the set of explicit Steiner treeswith shortest path trees (from the root node of the group to any receiver)through intelligent conguration of a unied set of weights of the networklinks. Accordingly, in their genetic algorithmic (GA) formulation, each chro-mosome is represented by a link weight vector with the size equal to thenumber of network links. Fitness of a chromosome is computed as a valueinversely proportional to the sum of the overall network load and excessivebandwidth allocated to overloaded links. A xed population size of 100 is cho-sen, and GA evolution starts with an initial population of random vectors oflink wights in the range 164. In the crossover operation, the osprings aregenerated by taking a chromosome from both the top and the bottom fty(sorted with respect to the tness values) and then choosing individual genesfor the osprings from either parent with a predened crossover probability.To escape from local minima, two types of mutation (changing the weight ofa link to a random value) are used: (i) global mutation of every link witha low mutation probability, and (ii) mutation of congested links. Results ofevaluation indicate that the proposed GA approach provides higher serviceavailability with signicantly less bandwidth compared to the conventionalIP (Internet Protocol) approaches.

    Neural network solutions to the allied problem of obtaining Steiner (mul-ticast) trees from the network graph may be found in the literature. Gelenbe,Ghanwani, and Srinivasan [75] demonstrate the use of random neural net-works for substantial improvement in the quality of the Steiner trees thatmay be obtained by using the best available heuristics such as the minimumspanning tree and the average distance heuristics. In [76], Xia, Li, and Yenpropose a modied version of SOFM (self-organizing feature map) [7] for theconstruction of balanced multicast trees.

    6 Multimedia Composition, Encoding, Streaming,and Synchronization

    Until now, our focus has been on control issues related to multimedia net-working. In this section, we survey the sparse literature on the CI methodsthat address pure communication issues related to multimedia data, such asmedia streaming, synchronization, encoding, etc. In the following section, wediscuss a few CI-based multimedia services.

    A cost eective (network bandwidth ecient) solution to multimedia con-tent delivery in media browsing applications is low resolution content deliveryduring navigation. The idea here is to permit the users to easily and quickly

  • Computational Intelligence in Multimedia Networking and Communications 69

    preview media sequences at various resolutions and zoom in on the segments oftheir interest. Doulamis and Doulamis [77] propose an optimal content-basedmedia decomposition (composition) scheme for such an interactive navigation.Though their proposal is for video navigation, the scheme can be extended tothe generic multimedia case. In their scheme, a set of representative shots isinitially extracted from the sequence to form a coarse description of the visualcontent. The remaining shots are classied into one or the other representa-tive shot class types. The content of each shot is similarly decomposed intorepresentative frames (frame classes) characterized by global descriptors suchas color, texture, motion parameters, and object (region) descriptors such assize and location. The other objects in the shot are classied into one or theother of the frame classes. The video decomposition problem is then posed as aproblem of optimally selecting representative shots (frames) so as to minimizethe aggregate correlation measure among the shots (frames) of the same class.In view of the exponential complexity of the search for optimal solution, theauthors use a genetic search using a binary string representation for denotingthe presence or absence of the shots (frames) in the sequence (shot) in therepresentative classes. The scheme is shown to oer a signicant reduction (85to 1) in the transmitted information compared to the traditional sequentialvideo scanning.

    In [78], Su and Wah propose an NN-based approach to compensation ofcompression losses in multi-description video streaming. To facilitate realtimeplayback, a three-layer FFNN in their system is trained in advance by theBPL algorithm using pixels from deinterleaved and decompressed frames asFFNN inputs, and those taken from the original frames (before compression)as desired outputs.

    Bojkovic and Milovanovic [79] propose a motion-compensated discrete co-sine transform (MC-DCT) based multimedia coding scheme that optimallyallocates more bits to regions of interest (ROI) compared to non-ROI imageareas. Identication of ROIs is done by a two-layer neural network with theFFNN in the rst layer for generation of the segmentation mask using thefeatures extracted from each image block, and the FFNN in the second layerfor improving the obtained segmentation by exploiting object continuity inthe segmentation mask provided by the rst network and additional features.Authors indicate that their approach achieves better visual quality of im-ages along with signal-to-noise ratio improvements compared to the standardMPEG (Moving Picture Experts Group) MC-DCT encoders.

    Automatic quantitative assessment of the quality of video streams overpacket networks is an important problem in multimedia engineering. Packetvideo quality assessment is a dicult problem that depends on a number ofparameters such as the source bit rate, the encoded frame type, the framerate at the source, the packet loss rate in the network, etc. A method forsuch an assessment, however, facilitates development of control mechanismsto deliver the best possible video quality given the current network situation.In [80], Mohamed and Rubino propose the use of Gelenbes random neural

  • 70 P. Guturu

    network (RNN) model [6]. Mohamed and Rubino show the results obtainedusing RNNs are well correlated with the results of subjective analysis usingour human perceptual system. Cramer, Gelenbe, and Gelenbe use an RNN-based scheme for video compression [81] and indicate that it is several timesfaster than H.261 and MPEG based compression schemes.

    Addressing the problem of integrating user preferences with network QoSparameters for streaming of the multimedia content, Ghinea and Magoulas[82] suggest the use of an adaptable protocol stack that can be dynami-cally congured with micro-protocols for various micro-operations involved instreaming such as sequence/ow control, acknowledgement, and error check-ing/correction coding. Then they formulate the protocol conguration as amulti-criteria decision making problem (to determine streaming priorities)that is solved by a fuzzy programming approach to resolve any inconsisten-cies between the user and the network considerations.

    Synchronized presentation of multimedia is an important aspect of nearlyall multimedia applications. In [83], Zhou and Murata adopt a CI approachto the media synchronization problem, and propose a fuzzy timing Petri Netmodel (FTPNM) for handling uncertain or fuzzy temporal requirements, suchas imprecisely known or unspecied durations. The model facilitates bothintra-stream and inter-stream synchronization with guaranteed satisfaction ofQoS requirements such as maximum tolerable jitters (temporal deviations) ofindividual streams, and maximum tolerable skew between media, by optimalplacement of synchronization points using the possibility distributions of theQoS parameters obtained from the model.

    Considering the need for lightweight synchronization protocols that canreadily adapt to the non-stationary workload of the browsing process andchanging network congurations, Ali, Ghafoor, and Lee [84] propose a neuro-fuzzy framework for media synchronization on the web. In their scheme, eachmultimedia object i.e., video, voice, etc.) is segmented into an atomic unitof synchronization (AUS). With this, the media synchronization turns outto be a problem of appropriately scheduling the AUS despatches by webservers. The authors observe that this, in turn, is a multi-criteria schedul-ing problem with objectives to: (i) minimize tardy (deadline missing) AUSs,(ii) complete the transmission of bigger AUSs as close to their deadline aspossible, and (iii) minimize dropping of AUSs in the event of severe resourceconstraints. This problem being NP-hard, the solution is approached througha neuro-fuzzy scheduler (NFS) that makes an intelligent compromise amongthe multiple objectives by properly combining a number of heuristic schedulingalgorithms proposed by the authors. The NFS comes up with scheduling deci-sions taking the workload parameters and system state parameters as inputs.A two phase learning scheme is used in the NFS with self-organized learningin phase-I to construct the presence of rules and locate initial membershipfunctions for the rules, and supervised learning in phase-II to optimally ad-just the membership functions for the desired outputs. Simulation studiesfor a comparative assessment of the proposed method against several known

  • Computational Intelligence in Multimedia Networking and Communications 71

    heuristic methods and a branch and bound algorithm demonstrate superioradaptability of the method under varying workload conditions.

    One of the rare applications of rough set theory (RST) to multimediais due to Jeon, Kim, and Jeong [85]. The authors propose a novel method(with attribute reduction by application of RST) for video deinterlacing. Intheir method, they estimate the missing pixels by employing, on a pixel-by-pixel basis, one of the following four earlier proposed deinterlacing methods:BOB [86], WEAVE [86], STELA [87], and FDOI [87]. Their deinterlacingapproach uses four parameters: TD, SD, TMDW and SMDW. The rst twoparameters refer to the temporal, and spatial dierences, respectively, betweentwo pixels across the missing pixel, and the remaining two refer to temporaland spatial entropy parameters described in [87]. Using six video sequences asthe training data, the authors rst construct a fuzzy decision table that mapseach set of the fuzzy values (small, medium, and high) of the attributes derivedfrom the above parameters onto a decision on the choice of an algorithm fromthe four mentioned above. The RST is then used for nding the core attributes,and eliminating superuous attributes by constructing the upper and lowerapproximation sets of the target algorithms for the subsets of the attributes.With experimentation on a dierent set of six standard video sequences, theauthors establish the superior performance of their method over a number ofmethods presented in the literature.

    7 Multimedia Services and Servers

    Prediction of user mobility prole in wireless multimedia communication sys-tems is an essential support service for eective resources utilization underQoS constraints. Shen, Mark, and Ye [88] propose an adaptive fuzzy infer-ence approach to user mobility prediction in a wireless network operating inFrequency Division Duplex (FDD) mode with DS/CDMA (Direct Spread-spectrum CDMA) protocol. The essential components of their system are afuzzy estimator (FE) and a recursive least squares (RLS) predictor. The FEtakes in as input the strength of the pilot signal from the mobile user, andpredicts the probability of the user being in a particular cell using a rulebase that takes into account imprecision in measurements, and shadow ef-fects. The RLS predictor then improves upon the estimate obtained from FEusing the previous few values of the mobile position.

    Addressing the problem of placing multiple copies of videos on dierentservers in a multi-server video on demand system, Tang et al. [89] propose ahybrid genetic algorithm to determine the optimal video placement that mini-mizes batching interval and server capacity usage while satisfying pre-speciedrequirement on blocking probability and individual server capacities. A chro-mosome in their GA formulation is an integer string with length equal tothe number of videos where each integer represents the number of copies ofa particular video. They use a tness vector with two components: (i) the

  • 72 P. Guturu

    blocking probability (computed as

    j qjBqj where qj is the portion of theeective trac allocated to a server (j), and Bqj is the blocking probabilityfor the server j that is computable using Erlang B formula [90] given thenumber of multicast streams j is handling), and (ii) the total capacity usage.For ranking the chromosomes in a population, a multi-objective pareto rank-ing scheme [91] is used. Osprings in the GA are generated by multi-pointcrossovers and mutation. The exact size of the population used in their ex-perimentation is not explicitly stated in the paper. The experimental resultsindicate that the proposed algorithm converges to the best value on blockingprobability at around 1000 generations, and the best value on server capacityusage in less than 4000 generations.

    In [92], Ali, Lee, and Ghafoor propose a design of multimedia web serverusing a neuro-fuzzy framework. The crux of their design is a neuro-fuzzyscheduler (NFS) for synchronized delivery of multimedia documents. In theprevious section, an overview of this scheduler has already been presented inthe context of media synchronization problem using a more comprehensivejournal publication by the same authors [84].

    8 Challenges and Future Directions

    Even though many CI-based approaches are being proposed for various appli-cations in multimedia networking and communications, their impact is mostlyconned to academic circles. These methods are yet to nd wide acceptancein industrial circles (possibly except in Japan), and get incorporated in manyindustrial products. This trend is also evident from the very small number ofindustrial patents in this direction. Hence, the main challenge of CI researchersis to provide the industry leaders a convincing demonstration of the superior-ity of their approaches over the traditional methods. Another challenge is todevelop methods compatible with existing standards, and new standards thatfacilitate CI-based implementations. Furthermore, since success of the fuzzymethods depends upon the compilation of a good knowledge base, gather-ing of rules of inference from experts remains a challenge for fuzzy systemsdesigners. Similarly, development of new types of neural networks, their train-ing algorithms, novel GAs preferably with parameters, e.g., population size,crossover and mutation rates) self-congurable by means of problem heuris-tics, and hybrid CI methods immensely suitable for the application problemat hand is always a challenge for the CI researchers.

    Most of the current literature on CI based methods for multimedia commu-nications addresses the ATM network issues. A few papers deal with wirelessmultimedia. With the current trend towards IP based multimedia commu-nications in both wired-line and 3GB (third generation and beyond) mobilewireless networks, there is a need to develop CI-based methods for IP-basednetwork communications. Further, as is obvious from relatively much smallercoverage on multimedia communication aspects compared to that on network

  • Computational Intelligence in Multimedia Networking and Communications 73

    control aspects in the current article, pure communication issues in multimediaand mobile multimedia are not that well addressed by the CI methods in thecurrent literature. The same applies to multimedia services and on-demandservices. New on-demand services may be designed by employing either newor existing CI based methods. Hence, exploration of CI methods for new ser-vices and communication methods will be a fruitful direction of research inthe future. Specic problems that have already been identied by the editorsof this volume in this context are: (i) multimedia semantic characteristics inwireless, mobile, and ubiquitous environments, (ii) extraction and usage ofsemantic information in wireless and mobile environments, (iii) multimediaretrieval in wireless and mobile networks, (iv) P2P multimedia streaming inwireless and mobile networks, and (v) performance evaluation of mobile mul-timedia services. Finally, from the perspective of the specic CI approachesthat need to be applied, explorations into possible applications of rough sets,and hybrids of neural, rough set, and fuzzy approaches to multimedia couldlead to new and interesting avenues of research.


    1. Rumelhart D E, McClelland J L (1986) Parallel Distributed Processing: Explo-rations in the Microstructure of Cognition, volume 1. MIT Press, Massachusetts

    2. Almeida L B (1987) Proceedings of the IEEE First International Conferenceon Neural Networks 11:609618

    3. Pineda F J (1987) Phys Rev Let 19:222922324. Hopeld J J (1982) Proceedings of the National Academy of Sciences of the

    USA 79(8): 255425585. Binder K, Ising Model (2001) SpringerLink Encyclopaedia of Mathematics,

    Springer6. Gelenbe E (1989) Neural Computing 1:5025117. Kohonen T (1997) Self-organizing maps, 2nd Edition, Springer Verlag, Berlin

    Heidelberg New York8. Sutton R S, Barto A G (1998) Reinforcement Learning: An Introduction. MIT

    Press, Massachusetts9. Puterman M L (1994) Markov Decision Processes. Wiley, New York

    10. Zadeh L A (1965) Information and Control 8: 33835311. Mamdami E H (1974) Proceedings of the Institute of Electrical Engineers 121

    (12):1585158812. Pawlak Z (1991) Rough Sets: Theoretical Aspects of Reasoning About Data.

    Kluwer, Dordrecht13. Dubois D (1990) International Journal of General Systems 17:19120914. Stefanowski J (1998) On rough set based approaches to induction of decision

    rules. In: Polkowski L, Skowron A (eds.) Rough Sets in Knowledge Discovery1: Methodology and Applications: 500529, Physica-Verlag, Heidelberg

    15. Ziarko W (1998) Rough sets as a methodology for data mining. In: PolkowskiL, Skowron A (eds.) Rough Sets in Knowledge Discovery 1: Methodology andApplications: 554576, Physica-Verlag, Heidelberg

  • 74 P. Guturu

    16. Fogel L J, Owens A J, Walsh M J (1966) Articial Intelligence Through Simu-lated Evolution. Wiley, New York

    17. Holland J H (1975) Adaptation in natural and articial systems. University ofMichigan Press, Ann Arbor

    18. Rechenberg I (1973) Evolutionstrategie: Optimierung Technisher Systeme nachPrinzipien des Biologischen Evolution. Fromman-Hozlboog Verlag, Stuttgart

    19. Schwefel H -P (1981) Numerical Optimization of Computer Models. John Wileyand Sons, New-York

    20. Koza J R (1992) Genetic Programming: On the Programming of Computersby means of Natural Evolution. MIT Press, Massachusetts

    21. Hiramatsu A (1990) IEEE Transactions on Neural Networks 1(1):12213022. Hiramatsu A (1995) IEEE Communications Magazine 33(10):58, 636723. Youssef S A, Habib I W, Saadawi T N (1997) IEEE Journal on Selected Areas

    in Communication (Special Issue on Computational and Intelligent Communi-cation) 15(2):191199

    24. Guerin R, Ahmadi H, Naghshineh M (1991) IEEE Journal on Selected Areasin Communication 9(7):968981

    25. Vakil F (1993) Proceedings of the IEEE GLOBECOM Conference 1993 (1):406416

    26. Uehara K, Hirota K (1997) IEEE Journal on Selected Areas in Communication(Special Issue on Computational and Intelligent Communication) 15(2):179190

    27. Bensaou B, Lam S T C, Chu H -W, Tsang D H K (1997) IEEE/ACM Trans-actions on Networking 5(4):572584

    28. Klir G J, Yuan B (1995) Fuzzy Sets and Fuzzy Logic: Theory and Applications.Prentice-Hall, New York

    29. Ren Q, Ramamurthy G (2000) IEEE Journal on Selected Areas in Communi-cation 18(2):184196

    30. Liang Q, Karnik N N, Mendel J M (2000) IEEE Transactions on Systems, Man,And CyberneticsPart C: Applications And Reviews 30(3):329339

    31. Cheng R -G, Chang C -J, Lin L -F (1999) IEEE/ACM Transactions on Net-working 7 (1):111121

    32. Chang C -J, Cheng R -G, Lu K -R, Lee H -Y (2000) Neural Fuzzy ConnectionAdmission Controller and Method in a Node of an Asynchronous Mode Transfer(ATM) Network. US Patent# 6067287

    33. Cheng R -G, Chang C -J (1996) IEEE/ACM Transactions on Networking4(3):460469

    34. Kesidis G, Walrand J, Chang C -S (1993) IEEE/ACM Transactions on Net-working 1(4):424428

    35. ChengR-G,ChangC-J (1997)Proceedingsof IEECommunications144(2):939836. Chatovich A, Oktug S, and Dundar G (2001) Computer Communications

    24:1031104437. Berenji H R, Khedkar P (1992) IEEE Transactions on Neural Networks

    3(5):72474038. Shen S, Chung-Ju C, ChingYao H, Qi B (2004) IEEE Transactions on Wireless

    Communications 3(5):1810182139. Ahn C W, Ramakrishna R S (2004) IEEE Transactions on Vehicular Technology

    53 (1):10611740. Senouci S -M, Beylot A -L, Pujolle G (2004) International Journal on Network

    Management 14:89103

  • Computational Intelligence in Multimedia Networking and Communications 75

    41. Ye J, Shen X(S), Mark J W (2005) IEEE Transactions On Mobile Computing4(2):129141

    42. Huang C Y, Yates R D (1996) Proceedings IEEE Vehicular Technology Con-ference 96:16651669

    43. Sun S, Krzymien W A (1998) Proceedings of the IEEE Vehicular TechnologyConference 98:218223

    44. Sherif M R, Habib I W, Nagshineh M, Kermani P (2000) IEEE Journal onSelected Areas in Communications 18(2):268282

    45. Yuang M C, Tien P L (2000) IEEE Journal on Selected Areas in Communica-tions 18(9):16581669

    46. Lo K -R, Chang C -J, Shung C B (2003) IEEE Transactions On VehicularTechnology 52 (5):11961206

    47. Rappaport S S, Hu L R (1994) Proceedings of the IEEE 82(9):1383139748. Lo K -R, Chang C -J, Chang C, Shung C B (1998) Computer Communications

    21(13):1143115249. Lo K -R, Chang C -J, Chang C, Shung C B (2000) IEEE Transactions on

    Vehicular Technology 49(5):1588159850. Moustafa M, Habib I, Naghshineh M N (2004) IEEE Transactions on Wireless

    Communications 3(6):2385239551. Fei Y, Wong V W S, Leung V C M (2006) Mobile Networks and Applications

    11:10111052. Hong D, Rappaport S S (1986) IEEE Transactions on Vehicular Technology

    35(3):779253. Talukdar A K, Badrinath B R, Acharya A (1998) Proceedings ACM/IEEE

    MobiCom 98:16918054. Douligeris C, Develekos G (1997) IEEE Communications Magazine 35(5):

    15416255. Sekercioglu A, Pitsillides A, Vasilakos A (2001) Soft Computing Journal

    5(4):25726356. Hiramatsu A (1991) IEEE Journal on Selected Areas in Communications

    9(7):1131113857. Matyas J (1965) Automation and Remote Control 26:24625358. Tarraf A A, Habib I W, Saadawi T N (1995) IEEE Communications Magazine

    33 10):768259. Tarraf A A, Habib I W, Saadawi T N (1993) Proceedings of the IEEE GLOBE-

    COM 93(2):996100060. Neves J E, de Almeida L B, Leitao M J (1994) Proceedings of the IEEE ICC

    94(2):76977361. Tarraf A A, Habib I W, Saadawi T N (1994) IEEE Journal on Selected Areas

    in Communications 12(6):1088109662. Tarraf A A, Habib I W, Saadawi T N (1995) Proceedings of the IEEE ICC

    95(1):20621063. Liu Y, Douligeris C (1995) Proceedings of the IEEE GLOBECOM

    95(1):29129564. Pitsillides A, Sekercioglu Y A, Ramamurthy G (1997) IEEE Journal on Selected

    Areas in Communications 15(2):20922565. Roberts L (1994) Enhanced PRCA (proportional rate-control algorithm). Tech-

    nical Report AF-TM 94-0735R1

  • 76 P. Guturu

    66. Lin C -T, Chung I -F, Pu H -C, Lee T -H, Jyh-Yeong Chang J -Y (2003)IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics32(6):832845

    67. Chen B -S, Yang Y -S, Lee B -K, Member, Lee T -H (2003) IEEE Transactionson Fuzzy Systems 11(4):568581

    68. Park Y -K, Lee G (1995) IEEE Communications Magazine 33(10):687469. Zhang Q, Leung Y -W (1999) IEEE Transactions on Evolutionary Computation

    3(1):536270. Mao S, Hou Y T, Cheng X, Sherali H D, Midki S F, Zhang Y -Q (2006) IEEE

    Transactions On Multimedia 8(5):1063107471. Eppstein D (1999) SIAM Journal of Computing 28(2):65267372. Papadimitratos P, Haas Z, Sirer E (2002) Proceedings of ACM Mobihoc:11173. Wang N, Pavlou G (2007) IEEE Transactions on Multimedia 9 (3):61962874. Hwang F K, Richards D S, Winter P (1992) The Steiner Tree Problem. Elsevier,

    North-Holland.75. Gelenbe E, Ghanwani A, Srinivasan V (1997) IEEE Journal on Selected Areas

    in Communications 15(2):14715576. Xia Z, Li P, Yen I-L (2004) Proceedings of the 18th International Parallel and

    Distributed Processing Symposium:546377. Doulamis A D, Doulamis N D (2004) IEEE Transactions on Circuits and Sys-

    tems for Video Technology 14(6):75777578. Su X, Wah B W (2001) IEEE Transactions on Multimedia 3(1):12313179. Bojkovic Z, Milovanovic D (2004) Seventh Seminar on Neural Network Appli-

    cations in Electrical Engineering:677180. Mohamed S, Rubino G (2002) IEEE Transactions On Circuits And Systems

    For Video Technology 12(12):1071108381. Cramer C, Gelenbe E, Gelenbe P (1998) IEEE Potentials 17(1):293382. Ghinea G, Magoulas G D (2005) IEEE Transactions on Multimedia

    7(6):1047105383. Zhou Y, Murata T (1998) IEEE International Conference on Systems, Man,

    and Cybernetics 98(1):24424984. Ali Z, Ghafoor A, Lee C S G (2000) IEEE Journal on Selected Areas in Com-

    munications 18(2):16818385. Jeon G, Kim D, Jeong J (2006) IEEE Transactions on Consumer Electronics

    52 (4):1348135586. Haan G D, Bellers E B (1998) Proceedings of the IEEE 86(9):1839185787. Jeon G, Jeong J (2006) IEEE Transactions on Consumer Electronics

    52(3):1013102088. Shen X, Mark J W, Ye J (2000) Wireless Networks 6:36337489. Tang W K S, Wong E W M, Chan S, Ko K -T (2004) IEEE Transactions on

    Broadcasting 50(1):1625 D.90. Bertsekas D, Gallager R (1992) Data Networks. Prentice-Hall, New York, p. 17991. Fonseca C M, Fleming P J (1998) IEEE Tranactions on Systems, Man, and

    Cybernetics Part A: Systems and Humans 28(1):263792. Ali Z, Lee C S G, Ghafoor A (2000) IEEE International Conference on Fuzzy

    Systems 9 (1):510515

  • Part II

    Computational Intelligence in 3D MultimediaVirtual Environment and Video Games

  • A Synthetic 3D Multimedia Environment

    Ronald Genswaider1, Helmut Berger1, Michael Dittenbach1,Andreas Pesenhofer1, Dieter Merkl2, Andreas Rauber1,2, and Thomas Lidy2

    1 E-Commerce Competence Center EC3iSpaces Research GroupDonau-City-Strasse 1A-1220 Wien, Austriaronald.genswaider@ec3.at, helmut.berger@ec3.at,michael.dittenbach@ec3.at, andreas.pesenhofer@ec3.at

    2 Department of Software Technology and Interactive SystemsVienna University of TechnologyFavoritenstrasse 9-11/188A-1040 Vienna, Austriadieter.merkl@ec.tuwien.ac.at, rauber@ifs.tuwien.ac.at,lidy@ifs.tuwien.ac.at

    Summary. In this chapter we present The MediaSquare, a synthetic 3D multimediaenvironment we are currently developing. The MediaSquare enables users, imperson-ated as avatars, to browse and experience multimedia content by literally walkingthrough it. Users may engage in conversations with other users, exchange experi-ences as well as collectively explore and enjoy the featured content. The combinationof algorithms from the area of articial intelligence with state-of-the-art 3D virtualenvironments creates an intuitive interface that provides access to manually as wellas automatically structured multimedia data while allowing to take advantage ofspatial metaphors.

    1 Introduction

    Millions of users interact, collaborate, socialize and form relationships witheach other through avatars in online environments such as Massively Multi-User Online Role-Playing Games (MMORPGs) [4,39,40]. While the predom-inant motivation to participate in MMORPGs is still playing, an increasingnumber of users is spending a signicant amount of time in 3D virtualworlds without following a predened quest. Generating, publishing and, mostimportantly, experiencing content in 3D virtual spaces is an emerging trendon the Internet with Second Life1 being the most prominent representative at

    1 http://secondlife.com.

    R. Genswaider et al.: A Synthetic 3D Multimedia Environment, Studies in Computational

    Intelligence (SCI) 96, 7998 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 80 R. Genswaider et al.

    the time of writing. On the one hand, such 3D virtual worlds address the as-pect of social interaction by providing instruments to interact and to exchangeexperiences with other users that go beyond the possibilities of conventionaltext-based chat rooms. Especially ones inherent presence in space and theawareness of others facilitate the initiation of social contacts. On the otherhand, using 3D virtual worlds has the advantage of communicating via com-monly accepted spatial metaphors [13]. Similarity of objects can be expressedby spatial relations, i.e. the more similar two objects are, the closer they areplaced together. Furthermore, users can interpret each others interests by howclose they are to one another and to the objects in space. Having a commonpoint of reference and orientation within the virtual space as well as beingaware that other users can see ones actions and objects in the same way, areimportant features regarding communication between users about particularlocations. Consequently, users are supported in building a mental model ofthe information space, to understand its characteristics and to grasp whichinformation is present and how the respective items relate to each other.

    The MediaSquare, a synthetic 3D multimedia environment, takes ad-vantage of these spatial metaphors and allows users to explore multimediainformation that is structured and organized within space (cf. Fig. 1). Theinformation is either organized based on the actual content or by trans-forming a branch of a directory into architectural structures. Currently, TheMediaSquare implements the following scenarios. The rst scenario, S1, is a 3DMusic Showroom that enables users to browse and listen to songs within thevirtual environment. To this end, acoustic characteristics are extracted frommusic tracks by applying methods from digital signal processing and psycho-acoustics. The features describe the stylistic facets of the music, e.g. beat,presence of voice, timbre, etc. and are used as features for the training of aself-organizing map that arranges similar music tracks in spatially adjacent

    Fig. 1. The MediaSquare implements the following scenarios: S1, 3D Music Show-room; S2, 3D Image and Video Showroom; S3, 3D Scientic Library

  • A Synthetic 3D Multimedia Environment 81

    regions. More precisely, the self-organizing map is an unsupervised neuralnetwork model that provides a topology-preserving mapping from a high-dimensional input space onto a 2D output space [18]. A second scenario, S2,aims at the implementation of a 3D Video and Image Showroom that allowsusers to experience content such as images or videos. To this end, character-istic features are extracted from the respective images or videos. The trainingof a self-organizing map is based on these features and, in analogy to the rstscenario, the resulting 2D map identies the actual position of each particularimage or video source within the 3D Video and Image Showroom. This partic-ular scenario will be fully integrated in the nal version of The MediaSquare.In the third scenario, S3, a 3D Scientic Library has been implemented. Thislibrary enables users to explore scientic documents such as posters or papersin this immersive 3D environment. On the one hand, a directory structure isused to create a room layout in which the content is presented. On the otherhand, characteristic text features are extracted from documents and are usedfor the training of a self-organizing map. Again, the resulting 2D map denesthe actual position of each document in the 3D representation.

    In a nutshell, the main contribution of The MediaSquare is the realizationof an impressive showcase for combining state-of-the-art multimedia featureextraction approaches and unsupervised neural networks assembled in an im-mersive 3D multimedia content presentation environment.

    The remainder of this chapter is organized as follows. In Sect. 2, anoverview about document clustering and digital libraries as well as musicinformation retrieval approaches and applications of 3D virtual environmentsis given. Self-organizing maps and the required feature extraction techniquesare outlined in Sect. 3. The system architecture of The MediaSquare is detailedin Sect. 4, followed by a description of the actual showcase in Sect. 5. Finally,in Sect. 6, some conclusions are given and an outlook on further research anddevelopment activities is provided.

    2 Related Work

    The design of user interfaces allowing the user to understand the contentsof a document archive as well as the results of a query plays a key role inmany digital library projects and has produced a number of dierent ap-proaches [13,14]. However, most designs rely on the existence of a descriptivetitle of a document to allow the user to understand the contents of the library,or use manual assignment of keywords to describe the topics of the collectionas used in the WEBSOM project [15], where units were labeled with the news-group that a majority of articles on a specic node came from. The LabelSOMmethod allows to automatically label the various areas of the library map withkeywords describing the topical sections based on the training results. Thisprovides the user with a clear overview of the contents of a SOM library mapsimilar to the maps provided at the entrance of conventional libraries [26].

  • 82 R. Genswaider et al.

    The necessity to visualize information and the result of searches in digitallibraries has gained interest. A number of visualization techniques for informa-tion retrieval and information representation purposes was developed at XeroxPARC as part of the Information Visualization Project [33]. Information ispresented in a 3D space with the focus laid on the amount of information be-ing visible at one time and an easily understandable way of moving throughlarge information spaces. As one of the rst examples of metaphor graphicsfor digital library visualization the Bookhouse project [29] may be mentioned,where the concept of a digital library is visualized using the representationof a library building with several rooms containing various sub-collectionsand icons representing a variety of search strategies. At the CNAM library,a virtual reality system was designed for the visualization of the antiquarianSartiaux Collection [8, 9], where the binding of each book is being scannedand mapped into a virtual 3D library to allow the user to experience the li-brary as realistically as possible. The Intelligent Digital Library [7] integratesa web-based visual environment for improving user-library interaction. An-other graphical, web-based tool for document classication visualization ispresented in [21]. While these methods address one or the other aspect ofdocument, library and information space visualization, none of these providesthe wealth of information presented by a physical object in a library, be it ahardcover book, a paperback or a video tape, with all the information that canbe intuitively told from its very looks. Furthermore, many of the approachesdescribed above require special purpose hardware, limiting their applicabil-ity as interfaces to digital libraries. The libViewer provides a exible way ofvisualizing information on the documents in a digital library by representingmetadata in an intuitive way [31].

    In the context of digital libraries only a few projects report on the ap-plication of collaborative multiuser 3D environments. Christoel and Schmittdeveloped a virtual representation of the University Library Karlsruhe em-ploying the game engine used for the realization of Quake II [6]. In orderto provide users with a familiar environment, the 3D representation of thelibrary was modeled very similar to the real world counterpart. Especiallyyoung people were attracted by this environment since they seemed to followtheir play-instinct.

    When shifting focus towards music, we witness the establishment of largemusic libraries which was supported by the emergence of powerful com-pression algorithms along with huge storage capabilities. These large mu-sic libraries require sophisticated search functionality that go beyond simplemetadata matching, such as artist, title, and genre. The query-by-hummingapproach introduced in the middle 1990s allows users to query songs by singingor humming melodies [12, 25]. Today, this technique has reached a maturestate and was implemented in the commercial online archive midomi.com.2

    2 http://www.midomi.com.

  • A Synthetic 3D Multimedia Environment 83

    Other algorithms addressing melodic structures are regular expression-stylequeries [10] or query-by-example techniques to nd cover versions of a musictrack [37].

    Other applications allow users to explore areas of related music instead ofquerying titles they already know. Torrens proposed three dierent visual rep-resentations for private music collections using genres to create sub-sectionsand the date of the tracks for sorting them [36]. Other works analyze thesound data for characteristic features and use a SOM to represent acousti-cally similar tracks. The PlaySOM and the PocketSOMPlayer [28] designedfor small devices such as palmtops and mobile phones, allow users to generateplaylists by marking areas on a map of music. Knees transformed the land-scape into a 3D view and enriched the units of the SOM by images relatedto the music found on the Internet [16]. Besides SonicSOM which follows theformer principle, Lubbers proposed SonicRadar, a graphical interface compa-rable to a radar screen. The center of this screen is the actual view-point ofthe listener [22]. By turning around, users can hear multiple neighboring mu-sic titles, panning and loudness of the sounds describe their position relativeto the user. Tzanetakis and Cook introduced Marsyas3D, an audio browserand editor for collaborative work on large sound collections [38]. A large-scalemultiuser screen oers several 2D as well as 3D interfaces to browse for soundles which are grouped by dierent sound characteristics. The MUSICtableprovides a collaborative interface on a tabletop display that invites all partic-ipants to select music tracks in a playful manner [35].

    3 Spatial Content Organization

    3.1 Self-Organizing Map

    The self-organizing map (SOM) is a general unsupervised tool for orderingof high-dimensional data in such a way that similar instances are groupedspatially close to one another [17, 18]. The model consists of a number ofneural processing elements, i.e. units. These units are arranged according tosome topology where the most common choice is marked by a 2D grid. Eachof the units i is assigned an n-dimensional weight vector mi, mi Rn. It isimportant to note that the weight vectors have the same dimensionality asthe instances.

    The training process of self-organizing maps may be described in terms ofinstance presentation and weight vector adaptation. Each training iterationt starts with the random selection of one instance x, x X and X Rn.This instance is presented to the self-organizing map and each unit deter-mines its activation. Usually, the Euclidean distance between the weight vec-tor and the instance is used to calculate a units activation. In this particularcase, the unit with the lowest activation is referred to as the winner, c. Finally,the weight vector of the winner as well as the weight vectors of selected units

  • 84 R. Genswaider et al.

    in the vicinity of the winner are adapted. This adaptation is implementedas a gradual reduction of the dierence between corresponding componentsof the instance and the weight vector, as shown in (1). Note that we use adiscrete-time notation with t denoting the current training iteration.

    mi(t + 1) = mi(t) + (t) hci(t) [x(t)mi(t)]. (1)The weight vectors of the adapted units are moved slightly towards the in-stance. The amount of weight vector movement is guided by the learningrate, , which decreases over time. The number of units that are aected byadaptation as well as the strength of adaptation depending on a units dis-tance from the winner is determined by the neighborhood function, hci. Thisnumber of units also decreases over time such that towards the end of thetraining process only the winner is adapted. The neighborhood function isunimodal, symmetric and monotonically decreasing with increasing distanceto the winner, e.g. Gaussian.

    The movement of weight vectors has the consequence that the Euclideandistance between instances and weight vectors decreases. So, the weight vec-tors become more similar to the instance. Hence, the respective unit is morelikely to win at future presentations of this instance. The consequence ofadapting not only the winner but also a number of units in the neighborhoodof the winner leads to a spatial clustering of similar instances in neighboringparts of the self-organizing map. Existing similarities between instances inthe n-dimensional input space are reected within the 2D output space of theself-organizing map. In other words, the training process of the self-organizingmap describes a topology preserving mapping from a high-dimensional inputspace onto a 2D output space. Such a mapping ensures that instances, whichare similar in terms of the input space, are represented in spatially adjacentregions of the output space.

    3.2 Text Feature Extraction

    In order to use the SOM for organizing documents based on their topics, avector-based description of the content of the documents needs to be created.While manually or semi-automatically extracted content descriptors may beused, research results have shown that a rather simple word frequency baseddescription is sucient to provide the necessary information in a very stableway [5,19,27,31]. For this word frequency based representation a vector struc-ture is created consisting of all words appearing in the document collection.Stop words, i.e. words that do not contribute to content representation andtopic discrimination between documents, are usually removed from this listof words. Again, while manually crafted stop word lists may be used, sim-ple statistics allow the removal of most stop words in a very convenient andlanguage- and subject-independent way. On the one hand, words appearingin too many documents, say, in more than half of all documents, can be re-moved without the risk of losing content information, as the content conveyed

  • A Synthetic 3D Multimedia Environment 85

    by these words is too general. On the other hand, words appearing in only asmall number of documents can be omitted for content-based classication, asthe resulting sub-topic granularity would be too small to form a topical clusterin its own right. Note that the situation is dierent in the information retrievaldomain, where rather specic terms need to be indexed to facilitate retrievalof a very specic subset of documents. In this respect, content-based organiza-tion and browsing of documents constitutes a conceptually dierent approachto accessing document archives and interacting with them by browsing topicalhierarchies. This obviously has to be supplemented by various searching facil-ities, including information retrieval capabilities as they are currently realizedin many systems.

    The documents are described by the words they are made up of within theresulting feature space, usually consisting of thousands of dimensions, i.e. dis-tinct terms. While basic binary indexing may be used to describe the contentof a document by simply stating whether or not a word appears in the docu-ment, more sophisticated schemes such as tf idf , i.e. term frequency timesinverse document frequency [34], provide a better content representation. Thisweighting scheme assigns higher values to terms that appear frequently withina document, i.e. have a high term frequency, yet rarely within the completecollection, i.e. have a low document frequency. Usually, the document vectorsare normalized to unit length to make up for length dierences of the variousdocuments.

    3.3 Audio Feature Extraction

    Content-based access to audio les, particularly music, requires the develop-ment of feature extraction techniques that capture the acoustic characteristicsof the signal, and that allow the computation of similarity between pieces ofresembling the acoustic similarity perceived by a listener. A feature set suit-able for describing acoustic characteristics in music are Rhythm Patterns.The algorithm was rst introduced in [30] and improved later by the inclu-sion of psycho-acoustic transformations in [32]. The feature set has proven tobe applicable to both classication of music into genres [20] and automaticclustering of music archives according to the perceived sound similarity [23].A Rhythm Pattern describes uctuations on critical frequency bands of thehuman auditory range and thus reects the rhythmical structure of a piece ofmusic. The algorithm for extracting a Rhythm Pattern is a two stage process:First, from the spectral data the specic loudness sensation in Sone is com-puted for 24 critical frequency bands. Second, this Sonogram is transformedinto a time-invariant domain resulting in a representation of modulation am-plitudes per modulation frequency.

    In more detail, in the rst part, a Short Time Fourier Transform (STFT) isapplied to compute a Spectrogram, whose frequency bands are then groupedaccording to the Bark scale into 24 psycho-acoustically motivated criti-cal bands. Successively, the Bark-scale Spectrogram is transformed into the

  • 86 R. Genswaider et al.

    decibel, Phon and Sone scales, resulting in a power spectrum that reectshuman loudness sensation. In the second part, by applying a Fourier Trans-form (FFT) the spectrum is transformed into a time-invariant representationshowing the magnitude of amplitude modulations for dierent modulation fre-quencies on the 24 critical bands. These amplitude modulations have dierenteects on the human hearing sensation depending on their frequency. Themost signicant is referred to as uctuation strength, which is most intenseat 4Hz and decreasing towards 15Hz. Consequently, a uctuation strengthweighting curve is applied, followed by a gradient lter and Gaussian smooth-ing, to improve resemblance between two Rhythm Patterns.

    A Rhythm Pattern is typically computed for every third segment of 6 slength in a song, and the feature set for a song is computed by taking themedian of multiple Rhythm Patterns. A Rhythm Pattern constitutes a com-parable representation of a song, which may be used to compute the similaritybetween two songs using a distance measure such as the Euclidean distanceor any other metric. Thus, the RP may be used as input for a self-organizingmap in order to automatically compute a similarity-based organization of amusic collection.

    4 The System Architecture

    The goal of The MediaSquare is to provide a 3D virtual environment allowingmultiple users to explore large multimedia repositories such as music or textcollections as well as video or image galleries. The underlying system architec-ture is depicted in Fig. 2, and the technological building blocks are described inthe following section. The core of the system is the Torque3 game engine thatis designed according to a strict client-server architecture. When the Torquegame engine is executed on a single computer both, client and server, arehandled by the same machine. Communication between client and server isenabled by means of a very robust networking protocol which allows accurateupdate rates even over low bandwidth Internet connections. The Torque serveris responsible for the execution of the virtual environment. This includes theinstantiation of the environment on startup and the coordination of objectsand users. The Torque client is mainly responsible for the user interface andthe audiovisual representation of the virtual environment. The game logicis written in TorqueScript which is compiled into byte-code before processing.Additionally, the source code of the engine can be adapted and extended torealize more complex and time-critical tasks. 3D objects, textures and soundles are stored in a special folder that resides on both, the server and the client.Whenever a client connects to the server the version of each le is checkedand, if necessary, updated les are automatically downloaded to the client.

    3 http://www.garagegames.com.

  • A Synthetic 3D Multimedia Environment 87

    Fig. 2. The system architecture

    The Torque game engine runs on all major operating systems. It provides acomprehensive set of design and development tools including a World Editor,a GUI Editor and a Terrain Editor, which assist perfectly during the creationof arbitrary games. Moreover, it oers multi-player network code, seamlessindoor/outdoor rendering engines, state of the art skeletal animation, drag anddrop GUI creation, and a C-like scripting language. For a smooth execution,Torque requires an Intel Pentium 4 processor, 128MB RAM with an OpenGLcompatible 3D graphics accelerator card. In addition to that and unlike mostcommercial game engines, the source code of the engine is distributed as part

  • 88 R. Genswaider et al.

    Fig. 3. Directory-based mapping of a slide show

    of the low cost royalty-free licensing policy, which facilitates the creation ofthe 3D multimedia environment The MediaSquare.

    The system is designed to enable access to various media types, namelyaudio, image and video, as well as text. To this end, the system oers directory-based mapping and automatic content-based organization of multimedia data.In order to integrate the respective data items into the environment, a num-ber of preprocessing steps need to be applied. In the case of directory-basedmapping, the data is manually organized according to a predened directorystructure as depicted in Fig. 3. The rst-level elements in the directory struc-ture are folders grouping related data. On the second level in the directorystructure, i.e. in sub-folders, the actual data is stored. Consider, for example,three slide shows consisting of ve, six and four slides respectively (cf. Fig. 3).In this case, the slides of each particular slide show are converted into a setof separate image les, which are, in turn, used as textures of presentationscreens in the virtual environment. Every folder, regardless of its hierarchylevel, contains a metadata information le describing the content of the folderin terms of title, author, date and the number of sub-elements.

    In case of content-based organization, self-organizing maps are employedto automatically structure media data. Depending on the media type the cor-responding feature extraction technique is selected, i.e. when text documentsare processed, the term-based feature extraction approach is used and in caseof music les, the audio feature extraction approach is employed. These fea-tures are used as input for the self-organizing map algorithm. The resulting2D map is described in terms of a Unit le, which determines the nal positionof each data item along with the dimensions of the map.

    The Torque Mission le is used to specify the characteristics of the vir-tual environment. This includes properties as, for instance, the topology ofthe terrain, the positions of static objects such as buildings as well as in-teriors, and environmental entities such as the sun or the sky. In order to

  • A Synthetic 3D Multimedia Environment 89

    Fig. 4. Segmentation of a SOM by means of Marker areas

    enable access to the media within the virtual world, Marker areas describingdesignated places in the virtual world, which specify the position of the me-dia, are created. These Marker areas are rectangular-shaped objects that areinvisible during runtime. Additionally, these objects contain properties thatspecify the underlying SOM and which specic parts thereof are selected fora particular area in the environment. On the one hand, this allows for therepresentation of more than one SOM in the virtual environment and, on theother hand, it enables to segment a single SOM. Figure 4 depicts the segmen-tation process of a SOM by means of Marker areas. In particular, a 44 SOMis divided into two segments consisting of two 2 4 SOMs respectively. Eachsegment is mapped onto one room in the virtual environment, i.e. Segment 1is mapped onto Room 1 and Segment 2 is mapped onto Room 2. In the con-text of directory-based mapping, Marker areas indicate the position where theautomatic lay out process of corresponding objects starts o. The propertiesof the Marker area object specify the URI of the associated directory andthe algorithm to lay out the graphical representations of the directory. Threedierent algorithms to lay out the directory structure are provided. The linearlay out algorithm aligns objects such as buildings, along a straight path. Thegenerated layout is comparable to a residential area with detached houses.When using the circular lay out algorithm, objects are arranged along a cir-cle with radius r. In case of the matrix-style layout, buildings are arrangedsimilar to a checker board whereas the distance between the buildings can befreely dened.

    The Repository contains templates of objects that can be used to visuallyrepresent the media in the virtual environment. More precisely, a template fora SOM unit consists of interiors, a label describing the media and an objectthat graphically represents the media. In case of directory-based mapping,

  • 90 R. Genswaider et al.

    the template is a container for multiple media objects including labels for thecontainer and the media itself. These templates are created with the TorqueWorld Editor and stored in the Repository.

    A Wrapper application processes the Unit les generated by the SOM aswell as the manually compiled directory. The Wrapper scans the Mission lefor Marker areas and imports the associated templates from the Repository.Then it loads the SOM-generated Unit les as well as the manually generateddirectory hierarchies into its internal object structure. It calculates the posi-tion and rotation of each media object. Subsequently, the Wrapper writes theinformation about all representative objects into an Objects le and createsthe Playlists for the Icecast4 multimedia streaming server. Since Torque doesnot provide network based streaming of audio les it was necessary to adaptthe audio emitter class of the game engine. The new implementation takesadvantage of FMOD,5 a very exible sound API that oers native supportfor HTTP-streaming. Objects instantiating this audio emitter class may bepositioned in the virtual environment and enable broadcasting of MP3 audiostreams as spatial sound. On the client side, an internal audio manager con-tinuously determines the distances between the users avatar and all audiosources in its vicinity. Only those audio sources that are closer than a certaindistance to the user, are actually streaming music.

    Automatically generated textures are used to label the media. In case ofdirectory-based mapping, the information les are used to generate labelsdescribing the contents of each directory. For the content-based organizationthe information stored in the SOM Unit les is used to determine the labelsdescriptions. When the featured media is music, each corresponding audioemitter object is labeled with a playlist that is automatically created by ex-tracting the ID3-tags from the music les.

    When the Torque server starts up, it loads the Mission le and createsthe 3D environment. Then it processes the Objects le and dynamically addsall media objects. In parallel, the Icecast server is started enabling audiostreaming in the virtual environment. After that, the system is up and runningand ready to accept connections from Torque clients.

    5 The MediaSquare

    The current implementation of The MediaSquare covers two scenarios of mul-timedia content presentation. First, a 3D Music Showroom providing accessto a music collection which was automatically organized based on the audiocontent was implemented and, second, a 3D Scientic Library was realized.This library enables users to explore scientic documents such as posters orpapers in this immersive 3D environment. On the one hand, a directory struc-ture was used to create a room structure in which the content is presented. On

    4 http://www.icecast.org.5 http://www.fmod.de.

  • A Synthetic 3D Multimedia Environment 91

    the other hand, characteristic text features were extracted from documentsand were used for the training of a self-organizing map.

    In order to participate in The MediaSquare the client application6 needsto be downloaded and installed. On the start screen users can either changethe display settings or proceed to the virtual world. When clicking on thestart-button, the user can select her favorite avatar and enter a name. Afterthat, the avatar is placed right in the center of The MediaSquare. The avataris navigated by means of the keyboard and its viewpoint is controlled via themouse. On pressing F4, a little chat window appears. When pressing c aconversation with other users can be started. Everything that is said will beheard by others in the vicinity.

    In the 3D Music Showroom users can listen to streamed music originatingfrom loudspeakers on coee tables (cf. Fig. 5). The music collection used inThe MediaSquare is from Magnatune,7 which is distributed under the cre-ative commons license for non-commercial use. This particular collection con-tains about 1,500 MP3 les featuring the genres classical, electronic, jazz,blues, pop, rock, metal, punk and world music. Since the Icecast multime-dia streaming server broadcasts music like a radio station, it is ensured thatall users are listening to the same music track when at the same location.Depending on the users position relative to the audio sources, one or morespatialized audio streams are audible. When the users avatar is close to anaudio source a head-up display (HUD) shows the currently playing track aswell as the corresponding playlist. This HUD can be toggled with the key F6.

    Fig. 5. 3D Music Showroom with head-up display

    6 Available for download at http://mediasquare.ec3.at.7 http://magnatune.com.

  • 92 R. Genswaider et al.

    On the lower left of the screen, detailed information about the currently play-ing track is displayed. When clicking the left mouse button the audio streamof the respective source skips to the next track and all users close to thisparticular source will notice the change.

    When reecting on the above, alternative implementations of the sameconcept can be considered. For example, a collection of music that is auto-matically organized by means of sound similarity and visualized as a musicstore. Replacing coee tables with shelves and playlist menus with CDs re-sults in a visual representation similar to its real-life counterpart. So, thesame principles of content organization can be employed whereas the result-ing visualization, user interaction and means for product consumption, diercompletely from the original implementation.

    The 3D Scientic Library of The MediaSquare provides access to the scien-tic results of the EU FP6 Network of Excellence on Multimedia Understand-ing through Semantics, Computation, and Learning (MUSCLE8). In this case,a directory structure is used to create a layout of rooms in which the contentis presented. The presentations are grouped according to dierent scienticmeetings and each directory contains several presentations that have beengiven there. The information le associated with each meeting describes thelocation and the date it was held. In case of a presentation the information lecontains the title and the authors names. This directory structure is mappedinto the 3D virtual environment according to the circular lay out algorithm.As a result, the environment contains several buildings whereof each repre-sents a particular meeting as shown in Fig. 6. Labels describing the meetings

    Fig. 6. Circular mapping of a directory structure in the 3D Scientic Library

    8 http://www.muscle-noe.org.

  • A Synthetic 3D Multimedia Environment 93

    Fig. 7. Floor plan of the building (left) and enlargement of main hall (right) withSOM unit positions and topic labels

    locations and dates are placed at the corresponding entrances. These buildingsfeature presentation screens that are attached to the walls and are used fordata visualization. Labels containing the metadata are attached on the exactopposite of each presentation screen. The textures of the screens change inpredened time intervals.

    In another building automatically organized scientic posters are pre-sented. In particular, the poster contributions to the WWW conference in2006 are on display. Figure 7 shows the oor plan of the building. It con-sists of a main hall and a gallery that can be reached via stairs. The postercontributions are arranged according to the mapping of a rectangular SOMconsisting of 4 3 units (main hall) and a U-shaped MnemonicSOM [24] thatts the gallery. The enlargement of the oor plan shows the layout of theposter topics. As an example, posters dealing with ontologies, Semantic Weband according technologies and standards are located in the top row. In themiddle row, we nd well-separated clusters dealing with Semantic Web appli-cations for museums, social Internet as well as more technical contributionsin the area of information retrieval and Web crawling. In the bottom rowthe topics of association rule mining, clustering, news feed analysis and linkanalysis are located. A screenshot depicting a slightly elevated view from thebottom left corner of the room is presented in Fig. 8. The dierent sizes ofthe poster stands are automatically determined and depend on the number ofposters assigned to the respective SOM units.

  • 94 R. Genswaider et al.

    Fig. 8. View from the bottom left corner of the main hall of the 3D Scientic Library

    6 Conclusions

    In this chapter, we have described The MediaSquare, a synthetic 3D multi-media environment that allows multiple users to collectively explore multi-media data and interact with each other. The data is organized within the3D virtual world either based on content similarity, or by mapping a givenstructure (e.g. a branch of a le system hierarchy) into a room structure.With this system it is possible to take advantage of spatial metaphors suchas relations between items in space, proximity and action, common referenceand orientation, as well as reciprocity. In this context it is essential to re-fer to Friedman [11]. He emphasizes in his seminal book on the phenomenonof the attening of the globe, that the world is literally becoming smaller.This shrinking is caused by the lightning-swift advances in technology andcommunications which put people all over the globe in touch as never be-fore. Environments such as The MediaSquare support this trend by allowinggeographically separated individuals to immerse into a collaborative virtualenvironment, interact with each other and collectively experience the featuredcontent. In a nutshell, The MediaSquare presents an impressive showcase forcombining state-of-the-art multimedia feature extraction approaches and un-supervised neural networks assembled in an immersive 3D multimedia contentpresentation environment.

    Current approaches (cf. Sect. 2) for the visualization of and interactionwith large data collections mainly focus on single media or document typesand their arrangement within space. The MediaSquare, however, provides ac-cess to several media types in one integrated environment. As of now, theshowcase comprises audio, images, and text documents arranged by means

  • A Synthetic 3D Multimedia Environment 95

    of an automatic content-based clustering algorithm. Additionally, The Medi-aSquare transcends other approaches by providing an environment for users,which fosters the social interaction while collaboratively experiencing the con-tent on display.

    Future work includes improved user interface capabilities, a tighter inte-gration of the single components of the system as well as the integration ofadditional feature extraction modules for other media types. Moreover, thesecond scenario, S2, aiming at the realization of a 3D Video and Image Show-room, will be implemented. To this end, methods for extracting characteristicfeatures from the respective images or videos need to be included. The trainingof a self-organizing map will be based on these features and, in analogy to therst scenario, the resulting 2D map will identify the actual position of eachparticular image or video source within the 3D Video and Image Showroom.


    This work was partially funded by the Austrian Federal Ministry of Economicsand Labour under the kind research program and the MUSCLE Network ofExcellence (project reference: 507752).


    1. H. Ahonen, O. Heinonen, M. Klemettinen, and A.I. Verkamo. Applying datamining techniques for descriptive phrase extraction in digital documents. InProceedings of the Advances in Digital Libraries Conference (ADL98), page 2,Santa Barbara, CA, 1998. IEEE Computer Society.

    2. R.B. Allen, P. Obry, and M. Littman. An interface for navigating clustereddocument sets returned by queries. In Proceedings of the Conference on Orga-nizational Computing Systems (COCS93), pages 166171, Milpitas, CA, 1993.ACM.

    3. M.Q. Baldonado and T. Winograd. Sensemaker: An information-exploration in-terface supporting the contextual evolution of a users interests. In Proceedings ofthe SIGCHI Conference on Human Factors in Computing Systems, pages 1118,Atlanta, GA, 1997. ACM.

    4. E. Castronova. Synthetic Worlds: The Business and Culture of Online Games.University of Chicago Press, Chicago, IL, 2005.

    5. H. Chen, C. Schuels, and R. Orwig. Internet categorization and search: Aself-organizing approach. Journal of Visual Communication and Image Repre-sentation, 7(1):88102, 1996.

    6. M. Christoel and B. Schmitt. Accessing libraries as easy as a game. In JCDL2002 Workshop: Visual Interfaces to Digital Libraries, pages 2538, London,UK, 2002. Springer.

    7. M.F. Costabile, F. Esposito, G. Semeraro, N. Fanizzi, and S. Ferilli. Interactingwith idl: The adaptive visual interface. In Proceedings of the Second EuropeanConference on Research and Advanced Technology for Digital Libraries, pages515534, Heraklion, Greece, 1998. Springer.

  • 96 R. Genswaider et al.

    8. P. Cubaud, J. Dupire, and A. Topol. Fluid interaction for the document incontext. In Proceedings of the 2007 Conference on Digital Libraries (JCDL07),page 504, Vancouver, Canada, 2007. ACM.

    9. P. Cubaud, C. Thiria, and A. Topol. Experimenting a 3D interface for the accessto a digital library. In Proceedings of the Third ACM Conference on DigitalLibraries (DL98), pages 281382, Pittsburgh, PA, 1998. ACM.

    10. M.J. Dovey. A technique for regular expression style searching in polyphonicmusic. In Proceedings of the International Symposium on Music InformationRetrieval (ISMIR 2001), 2001.

    11. T.L. Friedman. The World is Flat: A Brief History of the Twenty-First Century.Farrar, Straus and Giroux, New York, 2005.

    12. A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming:musical information retrieval in an audio database. In Proceedings of the ThirdACM International Conference on Multimedia, pages 231236, New York, NY,USA, 1995. ACM.

    13. S. Greenberg and M. Roseman. Using a room metaphor to ease transitions ingroupware. In M. Ackerman, V. Pipek, and V. Wulf, editors, Sharing Exper-tise: Beyond Knowledge Management, pages 203256. MIT, Cambridge, MA,January 2003.

    14. M. Hearst and C. Karadi. Cat-a-cone: an interactive interface for specifyingsearches and viewing retrieval results using a large category hierarchy. SIGIRForum, 31(SI):246255, 1997.

    15. T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WEBSOM Self-organizingmaps of document collections. In Proceedings of the Workshop on Self-Organizing Maps (WSOM97), Espoo, Finland, 1997.

    16. P. Knees, M. Schedl, T. Pohle, and G. Widmer. An innovative three-dimensionaluser interface for exploring music collections enriched with meta-informationfrom the web. In Proceedings of the 14th Annual ACM International Conferenceon Multimedia, pages 1724. ACM, 2006.

    17. T. Kohonen. Self-organized formation of topologically correct feature maps.Biological Cybernetics, 43, 1982.

    18. T. Kohonen. Self-Organizing Maps. Springer, Berlin Heidelberg New York,1995.

    19. T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, andA. Saarela. Self-organization of a massive document collection. IEEE Transac-tions on Neural Networks, 11(3):574585, May 2000.

    20. T. Lidy and A. Rauber. Evaluation of feature extractors and psycho-acoustictransformations for music genre classication. In Proceedings of the Inter-national Conference on Music Information Retrieval (ISMIR), pages 3441,London, UK, September 1115, 2005.

    21. Y. Liu, P. Dantzig, M. Sachs, J. Corey, M. Hinnebusch, T. Sullivan,M. Damashek, and J. Cohen. Visualizing document classication: A search aidfor the digital library. In Proceedings of the European Conference on DigitalLibraries, Heraklion, Greece, 1998.

    22. D. Lubbers. SoniXplorer: Combining visualization and auralization for content-based exploration of music collections. In Proceedings of the 6th InternationalConference on Music Information Retrieval (ISMIR 2005), pages 590593, 2005.

    23. R. Mayer, T. Lidy, and A. Rauber. The map of mozart. In Proceedings of theInternational Conference on Music Information Retrieval (ISMIR), Victoria,Canada, October 812, 2006.

  • A Synthetic 3D Multimedia Environment 97

    24. R. Mayer, D. Merkl, and A. Rauber. Mnemonic SOMs: Recognizable shapes forself-organizing maps. In M. Cottrell, editor, Proceedings of the Fifth Workshopon Self-Organizing Maps (WSOM05), pages 131138, Paris, France, September58 2005.

    25. R. J. McNab, L. A. Smith, I. H. Witten, C. L. Henderson, and S. J. Cunning-ham. Towards the digital music library: tune retrieval from acoustic input.In Proceedings of the First ACM International Conference on Digital Libraries(DL 96), pages 1118, New York, NY, USA, 1996. ACM.

    26. D. Merkl and A. Rauber. Automatic labeling of self-organizing maps for in-formation retrieval. In Proceedings of the International Conference on NeuralInformation Processing (ICONIP99), Perth, WA, 1999.

    27. D. Merkl and A. Rauber. Document classication with unsupervised neuralnetworks. In F. Crestani and G. Pasi, editors, Soft Computing in InformationRetrieval, pages 102121. Physica, 2000.

    28. R. Neumayer, M. Dittenbach, and A. Rauber. PlaySOM and PocketSOMPlayer,Alternative interfaces to large music collections. In Proceedings of the 6th In-ternational Conference on Music Information Retrieval (ISMIR 2005), pages618623, 2005.

    29. A. Pejtersen. A library system for information retrieval based on cognitive taskanalysis and supported by an icon-based interface. In Proceedings of the An-nual ACM SIGIR Conference on Research and Developement in InformationRetrieval (SIGIR89), 1989.

    30. A. Rauber and M. Fruhwirth. Automatically analyzing and organizing musicarchives. In Proceedings of the 5th European Conference on Research and Ad-vanced Technology for Digital Libraries (ECDL 2001), Springer Lecture Notesin Computer Science, Darmstadt, Germany, September 48, 2001. Springer.

    31. A. Rauber and D. Merkl. Text mining in the SOMLib digital library system:The representation of topics and genres. Applied Intelligence, 18(3):271293,2003.

    32. A. Rauber, E. Pampalk, and D. Merkl. Using psycho-acoustic models and self-organizing maps to create a hierarchical structuring of music by musical styles.In Proceedings of the 3rd International Symposium on Music Information Re-trieval (ISMIR 2002), pages 7180, Paris, France, October 1317 2002.

    33. G. Robertson, S. Card, and J. Mackinlay. Information visualization using 3Dinteractive animation. Communications of the ACM, 36(4):5771, 1993.

    34. G. Salton. Automatic Text Processing: The Transformation, Analysis, andRetrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.

    35. I. Stavness, J. Gluck, L. Vilhan, and S. Fels. The MUSICtable: A map-basedubiquitous system for social interaction with a digital music collection. InProceedings of the 4th International Conference on Entertainment Computing(ICEC 2005), 2005.

    36. M. Torrens, P. Hertzog, and J.-L. Arcos. Visualizing and exploring personalmusic libraries. In Proceedings of the 5th International Conference on MusicInformation Retrieval (ISMIR 2004), 2004.

    37. W.-H. Tsai. A query-by-example technique for retrieving cover versions ofpopular songs with similar melodies. In Proceedings of the 6th InternationalConference on Music Information Retrieval (ISMIR 2005), pages 183190, 2005.

    38. G. Tzanetakis and P. Cook. Marsyas3D: A prototype audio browser-editor usinga large scale immersive visual and audio display. In Proceedings of the Interna-tional Conference on Auditory Display, 2001.

  • 98 R. Genswaider et al.

    39. B.S. Woodcock. An analysis of mmog subscription growth. http://www.mmogchart.com/.

    40. N. Yee. The psychology of massively multi-user online role-playing games:Emotional investment, motivations, relationship formation, and problematicusage. In R. Schroeder and A. Axelsson, editors, Avatars at Work and Play:Collaboration and Interaction in Shared Virtual Environments, volume 34 ofComputer Supported Cooperative Work. Springer, Heidelberg, Germany, 2005.

  • Robotics and Virtual Reality: A Marriageof Two Diverse Streams of Science

    Tauseef Gulrez1, Manolya Kavakli1, and Alessandro Tognetti2

    1 Virtual Interactive Simulations of Reality (VISOR) Research Group,Department of Computing, Division of Information and CommunicationSciences, Macquarie University, Sydney, NSW 2109, AustraliaCorresponding Author: gtauseef@ics.mq.edu.au

    2 Interdepartmental Research Center E. Piaggio, Faculty of Engineering,University of Pisa, Italy

    Summary. In an immersive computationally intelligent virtual reality (VR) envi-ronment, humans can interact with a virtual 3D scene and navigate a robotic device.The non-destructive nature of VR makes it an ideal testbed for many applicationsand a prime candidate for use in rehabilitation robotics simulation and patient train-ing. We have developed a testbed for robot mediated neurorehabilitation therapythat combines the use of robotics, computationally intelligent virtual reality andhaptic interfaces. We have employed the theories of neuroscience and rehabilitationto develop methods for the treatment of neurological injuries such as stroke, spinalcord injury, and traumatic brain injury. As a sensor input we have used two state-of-the-art technologies, depicting the two dierent approaches to solve the mobilityloss problem. In our rst experiment we have used a 52 piezoresistive sensor ladenshirt as an input device to capture the residual signals arising from the patientsbody. In our second experiment, we have used a precision position tracking (PPT)system to capture the same signals from the patients upper body movement. Thekey challenge in both of these experiments was to accurately localise the movementof the object in reality and map its corresponding position in 3D VR. In this bookchapter, we describe the basic theory of the development phase and of the operationof the complete system. We also present some preliminary results obtained fromsubjects using upper body postures to control the simulated wheelchair.

    1 Introduction

    Vision is considered the most eective sensor of the human body. In schoolwe learnt that 70% of human body sensing relies solely upon human vision(eyes). In our view, human perception is heavily correlated with the humanexperiences built over time. It would not be wrong to say that humans build arepository of their life experiences over time, similar to a computers programsdatabase. This repository of information is also known as Internal Models of

    T. Gulrez et al.: Robotics and Virtual Reality: A Marriage of Two Diverse Streams of Science,

    Studies in Computational Intelligence (SCI) 96, 99118 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 100 T. Gulrez et al.

    the Human Mind [16, 21, 32, 40, 41, 44, 58, 63]. Life experiences are temporar-ily stored as information in the hippocampus area of the human brain andlater shifted to the neocortex area of the brain. Obviously not all experienceinformation is shifted to neocortex, but the most signicant information fromour experiences are stored as a function le. Later, whenever human beingswant to evaluate a new experience, they recall those repository functions andestimate their predictions based upon the new experiences. Upon getting theresults they correct their estimates with the innovation gains just like BayesRule [8], i.e.

    Posterior Likelihood InternalModels. (1)This idea of relating Bayes Rule to the human mind has been shown by manyresearchers [7, 23,29,32,61,63].

    The question that arises from the above discussion is whether or not it ispossible to deceive the human central nervous [46] system by showing articialsurrounding (virtual immersive) scenes, generated through sophisticated com-puterised projectors, which can show the context similar to the human mindrepository information (internal models). The idea is to make the human mindfeel that the computerised projection of the virtual environment is exactly thesame as what is experienced everyday, e.g. a simulated oce environment orhouse that human beings are familiar with. Although 70% of human sensingcomes from vision, there is another 30% composed of other sensing organssuch as touch, smell, feel and most importantly the human vestibular system.The human vestibular system enables us to orientate ourselves, and it alsogives us a feel about acceleration and velocity whilst sitting in a moving caror train, for example. With virtual reality technology, we can make a humanbeing feel that he is immersed in an articial reality environment, but it is dif-cult to deceive the other 30% of sensory receptors in the same environment.For example, in a virtually projected garden with pathways and ower beds,the lack of fragrance and a cool breeze will eliminate the illusion of being ina real garden, since human beings have never encountered a garden withoutthe smell of owers, etc. The human internal model has to be altered in orderfor humans to feel the sense of the fabricated virtual environment, in this casea garden. Presently, the only option available to recreate the virtual environ-ment is through a fully immersive virtual reality environment, secondly wecan incorporate sensing devices like data gloves or motion tracking systemsto make the user interactive in that environment.

    The rest of the book chapter is organised as follows, in the next few para-graphs we have described the advances happening in the eld of computation-ally intelligent VR, Robotics followed by disability around the world and anoverview of human brain anatomy. In the second section we have describedthe novel sensor shirt and the precision position tracking system interfacedin the computationally intelligent VR systems. In third section we have des-cribed the description of prototype testbed for stroke and spinal cord injuredpatients, followed by the conclusive remarks.

  • Robotics and Virtual Reality 101

    Advances in Virtual Reality

    Virtual reality has been instrumental in bringing research into the real world.VR has been conceived as a tool to liberate consciousness, a digital mandalafor the cyberian age. VR refers to computer-generated, interactive, 3D en-vironments into which people are immersed. It provides a way for people tovisualize, manipulate, and interact with simulated environments through theuse of computers and extremely complex data. Although still a developingtechnology, VR has already been successfully integrated into several aspectsof medicine and psychology. For example, VR is being used in the training ofsurgical procedures [57], the education of patients and medical students [38]and the treatment of psychological dysfunction including phobias [56], post-traumatic stress disorder [55], and eating and body image disorders [54]. Painmanagement methods integrating VR to distract patients attention from un-comfortable procedures, such as dental work, chemotherapy, and burn woundcare have also produced encouraging results [11, 26, 45]. Additionally, a num-ber of researchers have integrated VR into the assessment and rehabilitationof cognitive processes, such as visual perception and executive functions [53]and for training fundamental activities of daily life, such as using public trans-port [10] and meal preparation tasks [14]. The most important feature of VRis its non-destructive testing nature, which enables us to conduct the compli-cated and dangerous experiments which in reality could lead to fatal humaninjuries. Similarly, in order to understand the functionality of the brain, weneed a device that can help perform human motor tasks and build an environ-ment which can help neuroscientists to nd solution to research problems. VRmakes it possible to rapidly present various rehabilitation tasks with no setupand breakdown time, and provides many more important possibilities that arenot available with real-world applications, i.e. Distortions of Reality. Theproperty of objects can be changed in an instant, and this element of surpriseis critical for studying how the sensorimotor system reacts and adapts to newsituations.

    Disability at a Glance

    According to the World Health Organisations (WHO) 2006 World Report onDisability and Rehabilitation [6]:

    An estimated 10% of the worlds population approximately 600 mil-lion people, of which 200 million are children experience some formof physical, mental, or intellectual disability. The disabled populationis growing as a result of factors such as population growth, ageingand medical advances that are prolonging life. These trends are cre-ating considerable demands for health and rehabilitation services andrequire environmental and attitudinal changes.

    In Australia the most recent (1998) National Survey of Disability, Ageingand Carers (SDAC) [1] in 1998 by the Australian Bureau of Statistics (ABS)

  • 102 T. Gulrez et al.

    estimated that 3.6 million Australians 19% had some form of disability [22].Of these, 2.8 million 78% had a core activity restriction in self-care, mobilityor communication caused by their disability [1].

    The Australian Institute of Health and Welfare [1, 3] has recently pre-sented projections of the numbers of people with either profound or severecore activity restrictions for the years 20002031, based on the prevalence ofdisability in the SDAC [2, 3]. Projections indicate [2] a 70% increase in thenumber of older people with profound disability over the next 30 years. Themain conditions associated with profound or severe core activity restriction inolder Australians are musculoskeletal, nervous system, circulatory, respiratoryconditions and stroke. There is a clear need for a capable testbed for scienticstudy on upper-extremity motion. Robotic devices, designed to interface withhumans, have already led to great advances in both fundamental and clinicalresearch on the sensory motor system.

    Robotics Technology for Rehabilitation

    Advancement in technology has always brought a light of hope for the disabledpopulation. In particular, new state of the art robotic devices have alwaysplayed a vital role in the development of more eective rehabilitation devices.Robotics is considered an angel of survival in the disabled community [46].The degrees of impairment diers for each disability, including stroke, cere-bral palsy, tetraplegia, paraplegia, amputation, arthritis, osteoarthritis andheart disease. Devices are modied according to the disabled persons ability,allowing him or her to operate the device in accordance with their degree ofmovement. Sip and pu or chin controlled electrically powered wheelchairsare a good example of modied robotic rehabilitation devices. Robotic reha-bilitation device operation usually requires training to learn how to controlit. As long as the patient operates a rehabilitation device under the super-vision of an occupational therapist or clinician, they have low to no risk ofharming themself if they mishandle the device. In the worst case, the lackof appropriate training to control a rehabilitation device can lead to a fatalinjury and the technology as a life saviour can become a killer. In the caseof spinal cord [52] injured patients, who generally have poor control of theirupper body, they are at greater risk of encountering diculties and accidentswhen operating a powered wheelchair.

    The use of robots for providing physiotherapy is a relatively new disciplinewithin the area of medical robotics. It emerged from the idea of using robots toassist people with disabilities. The adaption of robotic devices to assist in neu-rorehabilitation was rst identied by Hogan at MIT [27]. There is currentlya high rate of expansion in the eld of neurorehabilitation. This rapid growthcan be attributed to several factors, the rst being the emergence of hardwarefor haptics and advanced robotics that could be made to operate safely withinthe a human workspace. The dramatic drop in the cost of computing alongwith the emergence of software to support real-time control further reduces

  • Robotics and Virtual Reality 103

    the cost of producing research prototypes and commercial products. This tech-nological shift has been coupled with better knowledge of the rehabilitationprocess and the social need to provide high-quality treatment for an ageingpopulation.

    Neuroplasticity: Learning in Humans and Animals

    Human behaviour is not constant. It can change over time as a result of ex-perience [33,40,42]. Similarly synaptic transmission is not constant. Synaptictransmission can change over time as a result of activity and other events inthe central nervous system (CNS). The speculation is in the idea that thereare two dierent memory storages (neocortex and hippocampus) in the CNSthat have complimentary properties (Fig. 1). The neocortex is thought to belimitless. There is virtually no detectable memory storage limit and it appearsto be permanent, like the way childhood memories can last for a lifetime. How-ever it is understood that the neocortex is a slow learning system, meaning ittakes many repetitions for the neocortical system to learn something. By con-trast, the hippocampus is thought to be a very rapid learning system wherethings are learnt after a single learning trial. However, it is believed that thehippocampus has a smaller capacity than the neocortex. They also have atemporary role, memory stored in the hippocampus [9, 20, 31, 37] exists fornot longer than a week or two. During that time memories are consolidated inthe hippocampus and the neocortex performs the rehearsal process. Once theneocortex completes the rehearsal the hippocampus then forgets that memory.This suggests that the hippocampus plays a time dependent role in handlingmemory. Similar results come from animal studies. If an animal is taught abehavioural task that involves the hippocampus a lesion will form in the hip-pocampus in direct response to that new memory. After one or two weeksthe lesion will disappear and the memory will be lost. However, the memory

    Fig. 1. Cross-sectional view of human brain and hippocampus area. Anatomy ofhuman brain, highlighting hippocampal and cortex regions. Picture is courtesy ofCerepus AS R, Norway

  • 104 T. Gulrez et al.

    will be transferred to the neocortex at that point for long term recall. Thisevidence arises from the idea that memory has stages:

    Initial encoding stage; followed by Period of consolidation (i.e. period of lesion)This conclusion shows that the transfer of activity is both intra and inter-modal and that where there is a need for the brain to reorganise to adapt tonew circumstances this reorganisation is not necessarily conned to the un-derstood maps of the homunculus brain [47]. The fact that this reorganisationoccurs even in mature adult humans is a primary justication for neuroreha-bilitation following a disability [33].

    2 Need for Engineering Smart BodyMachine Interfaces

    The human body is capable of learning even after stroke or injury [41, 43,44]. Presently wheelchair users have to practice how to control their devicebefore perfecting their technique [18, 19, 51]. In this case the responsibility oflearning resides solely with the patient. This situation for a patient is dicultand contrary to the marvels of advances in robotics and neuroscience [15,43]. The purpose of this book chapter is to modify the current situation byintroducing the novel idea of smart bodymachine interfaces [17, 34, 44, 50],capable of learning and understanding patients residual degrees of freedomand control. The idea behind the research is to take machines towards thepatients, rather than the patients towards the machines. In order to engineera smart bodymachine interface, we exploit two closely coupled concepts:the residual degrees of freedom of spinal cord injured patients to control theassistive device, and the ability of the brain to reorganise the movements afterdisability.

    2.1 3D Immersive Virtual Reality System

    3D Virtual Reality System

    VR is inherently multidimensional [28]. As well as freedom of translation androtation, in VR we can travel in scale and time [30]. Thus, the mental modelof the environment we perceive changes as we travel in VR. We have devel-oped a robotic VR training system, RIMS (Robotics Interactive MultisensorySimulation) for training stroke and spinal cord injured patients, using an im-mersive semi-cylindrical projection system (VISOR: Virtual and InteractiveSimulation of Reality) in our Virtual Reality Systems (VRS) Laboratory. Thesystem consists of three projectors which display the virtual world onto a 6-mwide semi-cylindrical screen canvas. The user is positioned slightly o centretowards the canvas to allow a 160 eld of view (FOV).

  • Robotics and Virtual Reality 105

    Precision Position Tracker (PPT)

    The WorldVizTM PPT system has been used as an alternative to the sensorshirt although both sensing systems are connected to the VR system. PPTconsisted of four CCD cameras Fig. 2. Two cameras were mounted on theprojectors as shown in Fig. 2 and two were mounted on the top of the projec-tion canvas screen. These CCD cameras are capable of tracking the infraredemitting diodes (IRED) as shown in the Fig. 3. The displacement of IREDsin the Euclidean space were mapped to the two control signals of the roboticwheelchair. Consequently the IREDs were attached to each shoulder of theparticipant and, by displacing shoulders, the participant was able to controlthe 3D robotic wheelchair in VR.

    (a) View of virtual reality lab in opera-tion

    (b) The 160 spanned 3D fully immersivevirtual reality projection system

    (c) Precision position tracking system CCD camerasmounted over the projectors

    Fig. 2. The 160 spanned virtual reality projection laboratory

  • 106 T. Gulrez et al.

    (a) IRED battery operated Sensors (b) A Graphical user Interface to trackthe IRED sensors

    Fig. 3. A precision position tracking system for virtual reality environments

    Fig. 4. Virtual robotics wheelchair

    3D Virtual Robotics Wheelchair

    For interoperability, extendability, maintenance and reusability purposes amodular design approach was taken, where each component had separate rolesand responsibilities and well-dened interfaces to allow other components toaccess their functionality. The modular design approach provides a sustainabledesign where we could (re)use existing third party components and swap com-ponents as required. A robotics wheelchair model was shown in Fig. 4 createdusing 3D Studio Max and integrated with a hospital type environment builtwith coin3D libraries to generate corridors and pathways. The whole systemwas projected on the screen using WorldVizTM Vizard interactive software.

    3D Virtual Robotic Wheelchairs Kinematics

    A non-holonomic mobile robotic kinematics study was used [59] to simulatethe robotic wheelchairs motion model. The wheelchair is considered as aunicycle type robot [12] having an egocentric axis of rotation, i.e. its centre of

  • Robotics and Virtual Reality 107

    (a) (b)

    Fig. 5. (a) Virtual wheelchair kinematics model based upon unicycle robot.(b) Virtual wheelchairs 3D model created using coin3D [4] libraries

    Fig. 6. Virtual wheelchairs position update

    gravity is calculated upon the point on the rear dierential wheels as shownin Figs. 5 and 6. The motion kinematic model is as follows: The wheelchair ismodelled as a simple two-wheel robot [12,59] as shown in Fig. 6. The kinematicequations of the wheelchair are

    x(t) = v(t) cos((t))y(t) = v(t) sin((t)) (2)(t) = (t) .

    The kinematic model of the wheelchair has two control inputs: the forwardspeed, v and the angular velocity (). Therefore, in discrete time, the law ofmotion of the wheelchair are

    xk+1 = xk + vk cos(k)tyk+1 = yk + vk sin(k)t (3)k+1 = k + kt .

    The two control inputs are generated by processing algorithms applied tothe shirt signals. The rotational and translational components of the speedare obtained by scaling two values, Vr and Vf derived from the subjects

  • 108 T. Gulrez et al.

    body motions. Accordingly, the virtual wheelchair moves from the actual point(xk, yk) to xk+1, yk+1 as represented in Fig. 5. In which

    S = vkt = k1Vf kt (4) = kt = k2Vrkt , (5)

    where k1, k2 are proportionality constants and t is the time interval betweenthe two consecutive frames of the virtual reality.

    2.2 Next Generation Sensor Shirt: Smart Garment

    To capture the residual mobility of the disabled patient, we have used a nextgeneration smart garment Sensor Shirt. The sensor shirt (Fig. 7) has been re-alized by directly printing a conductive elastomer (CE) material (commercial

    (a) Front view of 52-sensors laden wearableshirt

    (b) Back view of sensor laden shirt

    Fig. 7. Sensor shirt

  • Robotics and Virtual Reality 109

    product provided by Wacker LTD [5]) on a lycra/cotton fabric previouslycovered by an adhesive mask. The mask adopted to realise the sensor shirtis shown in Fig. 7 and it is designed according to the desired sensor and con-nection topology. CE composites show piezoresistive [60, 64] properties whena deformation is applied [49]. CE materials can be applied to the fabric orto other exible substrates. They can be employed as strain sensors [35, 36]and they represent an excellent trade-o between transduction properties andthe possibility of integration in textiles. Quasi-statistical and dynamic sensorcharacterisation has been done in [35]. Dynamic CE sensors present peculiarcharacteristics such as non-linearity in resistance to length transduction andlarge relaxation times [48,64] which should be take into account in the controlformulation.

    Sensor Shirt Layout

    The sensor shirt is divided into six sections as shown in Fig. 7 and Table 1.In each shirt section, sensors are connected in series and they are representedby the wider lines of Fig. 8. Connections between the sensors and the elec-tronic acquisition unit are represented by the thinner line of Fig. 8. Sinceconnections are realised by the same material adopted for the sensors, theyhave an unknown and unpredictable change in electrical resistance when theuser moves. For this reason the acquisition unit front-end has been designedto compensate for the connection resistance variations. The sensor series issupplied with a constant current I and the voltage fall across consecutive con-nections are acquired using high input impedance ampliers (instrumentationampliers) following the methodology of [62]. Let us consider the example ofsensor Sll 3 (the prototype electrical model and the acquisition strategy areshown in Fig. 8). This is sensor placed in the left wrist region of the shirt andit is represented by the light-blue line in Fig. 8. The connections to this sensorare represented in Fig. 8 by the two green lines. If the amplier is connectedbetween Cll 3 and Cll 4, only a little amount of current ows through inter-connections compared to the current that ows through S 3. In this way, ifthe current I is well dimensioned, the voltage read by the amplier is almost

    Table 1. Sensor shirt layout

    Body part Left side Right side

    Front shoulder 6 sensors 6 sensorsfs Slfs 1Slfs 6 Srfs 1Srfs 6Back shoulder 8 sensors 8 sensorsbs Slbs 1Slbs 8 Srbs 1Srbs 8Limb 12 sensors 12 sensorsl Sll 1Sll 12 Srl 1Srl 12

    Total sensors 26 26

  • 110 T. Gulrez et al.






    Srfs_6 SIfs_6







    (a) (b)

    Fig. 8. (a) The mask used for the sensor shirt realization. The sensor Sll3 placedon the left wrist (light blue line) and its connections (green lines) are pointed out.(b) Prototype (limb) electric model and acquisition strategy

    equal to the voltage drop on the sensor that is proportional to the sampleresistance. In conclusion, a generic sensor consists of a segment of the boldtrack between two consecutive thin track intersections.

    Signal Acquisition

    Two electronic acquisition customised units, (one for the left side and theother for the right side) were designed to acquire signals from the sensorshirt. Each unit consists of three signal generators (needed to supply the sen-sors series with the constant current), 32 instrumentation ampliers (neededto read voltages across sensors) and a nal stage for signal low pass lter-ing. The analog signals acquired from the two units are digitised using ageneral purpose 64 channel acquisition card and processed in real-time us-ing a personal computer. Real-time signal processing has been performed byusing the xPC-Target R toolbox of Matlab R. The output of the signal process-ing stage, i.e. the wheelchair controls, are sent to the virtual wheelchair de-scribed in the section below by using universal datagram protocol (UDP)connection.

    3 A Virtual Reality Rehabilitation Testbed for SpinalCord Injured Patients

    A testbed for rehabilitation purposes, especially for spinal cord injured (SCI)patients, has been designed with the help of a VR interactive system. Thenovelty of the testbed lies in the fact that we mapped [39] the redundantbody signals, i.e. the left over mobility of the SCI patients, to control thetranslational (v) and angular velocity () of the wheelchair. Likewise it is alsopossible to derive translational acceleration (v) and angular acceleration ()

  • Robotics and Virtual Reality 111

    from residual mobility to derive virtual wheelchair inside the virtual realityscene. Two dierent approaches has been used to test the ecacy of the systemand in both cases encouraging results were obtained.

    Virtual Navigation with Precision Position Tracking (PPT) System

    In PPT system navigation, we attached the sensors with the shoulders of thepatient as shown in Fig. 9. The participant was asked to move his shouldersforward and backward to calibrate the forwardbackward and leftright move-ment of the virtual wheelchair. Once the sensors were calibrated, participantswere immersed in the virtual scene consisting of pathways, corridors, rooms. Adark line was painted on the oor for the participant to follow in the virtualscene. The participant was able to navigate in the environment reasonablywell using the arm and shoulder movements with minimal practice.

    (a) PPT IRED sensors attached tothe participants shoulder

    (b) Participant is controlling the ro-botics wheelchair through shouldermovements

    (c) The control strategy of realwheelchair is adopted to control thevirtual wheelchair in 3D VR

    (d) Trajectory obtained from theparticipants data while training invirtual reality

    Fig. 9. Navigating a virtual wheelchair simulator in 3D virtual reality with precisionposition tracking system

  • 112 T. Gulrez et al.

    Virtual Navigation with Novel Sensor Shirt

    The sensor shirt as shown in Fig. 10 was worn by participant. The left overbody mobility signals were captured by the acquisition system. The partic-ipant was asked to make comfortable body movements using their availablemobility range. These comfortable body postures were then mapped into thecontrol signals of the wheelchair, i.e. forwardbackward and leftright posi-

    t = tmLeft elbow movement

    t = tIRight shoulder movement

    t = tiLeft shoulder movement

    t = thRight elbow movement

    t = 0Rest Position

    (a) Remapping of mobility and control strategy

    (b) Sensor shirt control inVirtual Reality


    0 50 100 150 200 2500.4












    0 50 100 150 200 250

    Rigth Limb

    Sec.Left Limb


    (c) Redundant body signals of the two limbs

    Fig. 10. Navigation in virtual reality while wearing sensor shirt

  • Robotics and Virtual Reality 113

    (a) Virtual reality environment (b) Top view of the Trajectory in Virtualreality



    Trajectory, Mean Error = 4.7981

    Subject TrajectoryPath










    25 20 15 10 5 0 5 10 15 20 25

    (c) Trajectory obtained from partici-pants data in virtual reality, in rst fewtrials

    Trajectory, Mean Error =1.847












    25 20 15 10 5 0 5 10 15 20 25

    Subject TrajectoryPath

    (d) Trajectory data of participant afterlearning phase

    Fig. 11. Navigating a virtual wheelchair simulator in 3D virtual reality with preci-sion position tracking system

    tions. Again the participant fully immersed in the 3D VR scene similar to theone described in the rst experiment. After minimal practice, the participantwas able to comfortably control the wheelchair, as shown in the Figs. 10 and11 the path made by participant after little practice over the marked line ofthe virtual reality scene.

    4 Conclusion

    The amalgamation of robotics technology, intelligent interfaces and 3D im-mersive virtual reality may lead to the development of a whole new approachto the design of assistive devices. This approach is based on the key conceptthat the burden of learning to control such devices should not fall entirely onthe patient. The eld of multimedia and machine learning has been rapidly

  • 114 T. Gulrez et al.

    developing in the recent decade and is now suciently mature to design inter-faces that are capable of learning the user as the user is learning to operate thedevice. In this case, learning the user means learning the degrees of freedomthat the patient is capable to move most eciently and mapping these degreesof freedom to wheelchair movements. We should stress that such mappingcannot be static because in some cases the patients will eventually improvewith practice. In other, more unfortunate cases, a disability may progressivelydegenerate and the patients mobility may deteriorate as a result. We haveapplied and tested the rehabilitation process in virtual reality via onboard ando-board sensing. The mapping of body movements in virtual reality in anopportunistic way has been shown in [13, 24, 25] and we intend to apply thesame techniques in our future experiments. Our approach takes technologytowards the patient rather than the patient towards the technology.


    This research has been approved by the ethical committee of Macquarie Uni-versity Sydney Australia, under the humans research ethical act of New SouthWales, Australia, in approval letter No. HE23FEB2007-D05008 titled Per-sonal Augmented Reality and Immersive System based Body machine Interface(PARIS based BMI).


    1. Australian Bureau of Statistics (ABS). 1998 disability, ageing and carers, aus-tralia: Condentialised unit record le. Technical paper. Canberra: ABS.1999.

    2. Australian Bureau of Statistics (ABS). Population projections australia: 1999to 2101. Canberra: ABS, 2000, (Catalogue No. 3222.0.).

    3. Australian institute of health and welfare (AIHW). disability and ageing: Aus-tralian population patterns and implications. Canberra: AIHW. 2000, (AIHWCatalogue No. DIS 19.).

    4. Coin 3d graphics library. www.coin3d.org.5. Elastosil lr3162. www.wacker.com.6. World Health Organisation. World report on disability and rehabilitation. Con-

    cept Paper, World Health Organisation. 2006.7. C. Baker, J.B. Tenenbaum, and R.R. Saxe. Bayesian models of human action

    understanding. Advances in Neural Information Processing Systems, 18, 2006.8. G.A. Barnard and Thomas Bayes. Studies in the history of probability and

    statistics: Ix. thomas bayess essay towards solving a problem in the doctrine ofchances. Biometrika, 45:293315, 1958.

    9. T.V.P. Bliss and G.L. Collingridge. A synaptic model of memory: Long-termpotentiation in the hippocampus. Nature, 361:3139.

    10. D.J. Brown, S.J. Kerr, and V. Bayon. The development of the virtual city: A usercentered approach. In 2nd European Conference on Disability, Virtual Realityand Associated Techniques, Mount Billingen, Skovde, Sweden, September 1998.

  • Robotics and Virtual Reality 115

    11. A. Buckert-Donelson. Heads-up products: Virtual worlds ease dental patients.VR World, 3:916, 1995.

    12. K. ByungMoon and T. Panagiotis. Controllers for unicycle-type wheeled robots:Some theoretical results and experimental validation. IEEE Transactions onRobotics and Automation, 18(3):294307, 2002.

    13. S. Challa, T. Gulrez, Z. Chazcko, and T. Paranesha. Opportunistic informationfusion: A new paradigm for next generation networked sensing systems. In8th IEEE International Conference on Information Fusion, Philadelphia, USA,2005.

    14. C. Christiansen, B. Abreu, K. Ottenbacher, K. Human, B. Masel, andR. Culpepper. Task performance in virtual environments used for cognitiverehabilitation after traumatic brain injury. Archives of Physical Medicine andRehabilitation, 79:888892, 1998.

    15. M.E. Clynes and N.S. Kline. Cyborgs and space. Astronautics, American RocketSociety, 14:2627, 1960.

    16. M.A. Conditt and F.A. Mussa-Ivaldi. Central representation of time dur-ing motor learning. Philosophical Transcations of Royal Society of London,96:1162511630, 1999.

    17. J.P. Donoghue, Connecting cortex to machines: Recent advances in brain inter-faces. Nature Neuroscience Reviews, 5:10851088, 2002.

    18. L. Fehr, W.E. Langbein, and S.B. Skaar. Adequacy of power wheelchair con-trol interfaces for persons with severe disabilities: A clinical survey. Journal ofRehabilitation Research and Development, 37:353360, 2000.

    19. C.C. Flynn and C.M. Clark. Rehabilitation technology: Assessment practicesin vocational agencies. Assistive Technology, 7:111118, 1995.

    20. T.F. Freund and G. Buzsaki. Interneurons of the hippocampus. Hippocampus,6:347470, 1958.

    21. F. Gandolfo, F.A. Mussa-Ivaldi, and E. Bizzi. Motor learning by eld approxi-mation. Proceedings of National Academy of Sciences USA, 93:38433846, 1996.

    22. C.L. Giles, D. Cameron, and M. Crotty. Disability in older australians: Projec-tions for 20062031. Medical Journal of Australia, 179:130133, 2003.

    23. T.L. Griths and J.B. Tenenbaum. Statistics and the bayesian mind. Signi-cance, 3(3):130133, 2006.

    24. T. Gulrez and S. Challa. Sensor relevance validation for autonomous mobile ro-bot navigation. In IEEE Conference on Robotics Automation and Mechatronics(RAM), Bangkok, Thailand, June 79, 2006.

    25. T. Gulrez, S. Challa, T. Yaqub, and J. Katupitiya. Relevant opportunisticinformation extraction scheduling in heterogeneous sensor networks. In 1st IEEEInternational Workshop on Computational Advances in Multi-Sensor AdaptiveProcessing, Mexico-City, 2005.

    26. H.G. Homan, J.N. Doctor, D.R. Patterson, G.J. Carrougher, and T.A.I.Furness. Use of virtual reality for adjunctive treatment of adolescent burn painduring wound care: A case report. Pain, 85:305309, 2000.

    27. N. Hogan, H. Krebs, J. Charnnarong, P. Srikrishna, and A. Sharon. Mit -vmanus a workstation for manual therapy and training ii. In SPIE Conf. Tele-manipulator Technologies, pages 2834, 1992.

    28. A. Johnson, D. Sandin, G. Dawe, Z. Qiu, and D. Plepys. Developing the paris:Using the cave to prototype a new vr display. In Proceedings of IPT 2000, Ames,Iowa, USA, June 2000.

  • 116 T. Gulrez et al.

    29. K. Kording and D. Wolpert. Bayesian integration in sensorimotor learning.Nature, 427:244247, 2004.

    30. M. Kavakli and M. Lloyd. Spaceengine: A seamless simulation system for vir-tual presence in space. In Innovations in Intelligent Systems and Applications,IEEE Computational Intelligence Society, pages 231233, Turkey, Yildiz Tech-nical University, Istanbul, Turkey, 2005.

    31. J. Keefe and L. Nadel. The Hippocampus as a Cognitive Map. Oxford UniversityPress, New York, 1978.

    32. K. Kording and D. Wolpert. Bayesian decision theory in sensorimotor control.Review Trends in Cognitive science, 10:319326, 2006.

    33. J.W. Krakauer and R. Shadmehr. Consolidation of motor memory. Review Trends in Neuroscience, 29:5864, 2006.

    34. A. Kubler. Brain computer communication: Unlocking the locked. PsychologyBulletin, 127:358375, 2001.

    35. F. Lorussi, W. Rocchia, E.P. Scilingo, A. Tognetti, and D. De Rossi. Wear-able redundant fabric-based sensors arrays for reconstruction of body segmentposture. IEEE Sensors Journal, 4(6):807818, 2004.

    36. F. Lorussi, E.P. Scilingo, M. Tesconi, A. Tognetti, and D. De Rossi. Strainsensing fabric for hand posture and gesture monitoring. IEEE Transactions onInformation Technology in Biomedicine, 9(3):372381, 2005.

    37. J.L. McClelland, B.L. McNaughton, and R.C. OReilly. Why there are comple-mentary learning systems in the hippocampus and neocortex: Insights from thesuccesses and failures of connectionist models of learning and memory. Psych-logical review, 102:419457, 1995.

    38. Medical, Readiness, and Trainer-Team. Immersive virtual reality platform formedical training: A killer-application. In Medicine Meets Virtual Reality 2000,pages 207213, Burke, Virginia, USA, 2000.

    39. C. Mercier, K. Reilly, C. Vargas, A. Aballea, and A. Srigu. Mapping phan-tom movement representations in the motor cortex of amputees. Brain, 129:22022210, 2006.

    40. F.A. Mussa-Ivaldi and E. Bizzi. Motor learning through the combinationof primitives. Philosophical Transcations of Royal Society of London, 355:17551769, 2000.

    41. F.A. Mussa-Ivaldi, A. Fishbach, T. Gulrez, A. Tognetti, and D. De, Rossi.Remapping the residual motor space of spinal-cord injured patients for the con-trol of assistive devices. In Neuroscience 2006, Atlanta, GA, USA, October1418, 2006.

    42. F.A. Mussa-Ivaldi, N. Hogan, and E. Bizzi. Neural, mechanical, and geo-metric factors subserving arm posture in humans. Journal of Neuroscience,5:27322743, 1985.

    43. F.A. Mussa-Ivaldi and L.E. Miller. Brain machine interfaces: Computationaldemands and clinical needs meet basic neuroscience. Review, Trends in Neuro-science, 26:329334, 2003.

    44. F.A. Mussa-Ivaldi and S. Solla. Neural primitives for motion control. IEEEJournal of Oceanic Engineering, 29:640650, 2004.

    45. M. Oshuga, F. Tatsuno, K. Shimono, K. Hirasawa, H. Oyama, and H. Okamura.Development of a bedside wellness system. Cyberpsychology and Behavior,1:105111, 1998.

  • Robotics and Virtual Reality 117

    46. J. Patton and F. Mussa-Ivaldi. Robotic teaching by exploiting the nervous sys-tems adaptive mechanisms. In 7th International Conference on RehabilitationRobotics (ICORR), Evry, France, 2001.

    47. W. Peneld and T. Rasmussen. The Cerebral Cortex of Man: A Clinical Studyof Localisation of Function. Macmillan, New York, 1950.

    48. Wang Peng, Xu Feng, Ding Tianhuai, and Qin Yuanzhen. Time dependence ofelectrical resistivity under uniaxial pressures for carbon black/polymer compos-ites. Journal of Materials Science, 39(15), 2004.

    49. Wang Peng, Ding Tianhuai, Xu Feng, and Qin Yuanzhen. Piezoresistivity ofconductive composites lled by carbon black particles. Acta Materlae Composi-tae Sinica, 21(6):3438, 2004.

    50. B.E. Pngst. Neural Prostheses for Restoration of Sensory and Motor Function,J.K. Chapin, K.A. Moxon (eds.). CRC, Boca Raton, 2000.

    51. R.G. Platts and M.H. Fraser. Assistive technology in the rehabilitation of pa-tients with high spinal cord injury lesions. Paraplegia, 31:280287, 1993.

    52. M.W. Post, F.W. vanAsbeck, A.J. vanDijk, and A.J. Schrijvers. Spinal cordinjury rehabilitation: 3 functional outcomes. Archives of Physical Medicine andRehabilitation, 87:5964, 1997.

    53. L. Pugnetti, L. Mendozzi, E. Barbieri, F. Rose, and E. Attree. Nervous systemcorrelates of virtual reality experience. In First European Conference on Dis-ability, Virtual Reality and Associated Technology, pages 239246, Maidenhead,UK: The University of Reading, July 1996.

    54. G. Riva and L. Melis. Virtual reality for the treatment of body image dis-trubance. In Virtual Reality in Neuro-Psycho-Physiology, Amsterdam, 1997.

    55. B.O. Rothbaum, L.F. Hodges, R. Alarcon, D. Ready, F. Shahar, K. Graap,J. Pair, P. Hebert, D. Gotz, B. Wills, and D. Baltzell. Virtual reality exposuretherapy for ptsd vietnam veterans: A case study. Journal of Traumatic Stress,12:263272, 1999.

    56. B.O. Rothbaum, L.F. Hodges, and R. Kooper. Virtual reality exposure therapy.Journal of Psychotherapy Practice and Research, 6:291296, 1997.

    57. R.M. Satava. Medical virtual reality: The current status of the future. In TheMedicine Meets Virtual Reality 4th Conference, pages 100106, Berlin, Germany,Sept 1996.

    58. R. Shadmehr and F.A. Mussa-Ivaldi. Adaptive representation of dynamics dur-ing learning of a motor task. Journal of Neuroscience, 14(5):32083224, 1994.

    59. S. Takezawa, T. Gulrez, C.D. Herath, and W.M. Dissanayake. Environmentalrecognition for autonomous robot using slam. real time path planning withdynamical localised voronoi division. International Journal of Japan Society ofMechanical Engineering (JSME), 3:904911, 2005.

    60. M. Taya, W.J. Kim, and K. Ono. Piezoresistivity of a short ber/elastomermatrix composite. Mechanics of Materials, 28(3):5359, 1998.

    61. J.B. Tenenbaum and T.L. Griths. Generalization, similarity, and bayesian in-ference. Behavioral and Brain Sciences, 24:629641, 2001.

    62. A. Tognetti, F. Lorussi, R. Bartalesi, S. Quaglini, M. Tesconi, G. Zupone, andD. De Rossi. Wearable kinesthetic system for capturing and classifying upperlimb gesture in post-stroke rehabilitation. Journal of NeuroEngineering andRehabilitation, 2(8), 2005.

  • 118 T. Gulrez et al.

    63. D.M. Wolpert, Z. Ghahramani, and M.I. Jordan. An internal model for senso-rimotor integration. Science, 269:18801882, 1995.

    64. XiangWu Zhang, Yi Pan, Qiang Zheng, and XiaoSu Yi. Time dependence ofpiezoresistance for the conductor-lled polymer composites. Journal of PolymerScience, 38(21), 2000.

  • Modelling Interactive Non-Linear Stories

    Fabio Zambetta

    School of CS&IT, RMIT UniversityGPO Box 2476VMelbourne VIC 3001, Australiafabio.zambetta@rmit.edu.au

    Summary. A CPRG (Computer Role Playing Game) is a video game whereparticipants assume the roles of ctional characters and collaboratively create orimmerse in stories. Designers of such games usually create complex epic plots fortheir players to experience, but formal modelling techniques to shape and navigateinteractive stories are not currently adopted. In this chapter we build the case for astory-driven approach to the design of CRPGs exploiting a mathematical model ofpolitical balance and conict, and scripting based on fuzzy logic. Our model diersfrom a standard HCP (Hybrid Control Process) for the use of fuzzy logic (or fuzzystate machines) to handle events, while an ODE (Ordinary Dierential Equation)is needed to generate continuous level of conict over time. Ultimately, using thisapproach not only can game designers express gameplay properties formally usingquasi-natural language, but they can also propose a diverse role-playing experienceto their players. The interactive game stories designed with this methodology canchange under the pressure of a variable political balance, and propose a dierentand innovative gameplay style.

    1 Introduction

    Computer Role Playing Games allow participants to assume the roles of c-tional characters and collaboratively create or unravel complex storylines.Game developers usually face a complex task, which consists in providingplayers with a compelling and coherent storyline that exists in an interactivevirtual environment. Unfortunately, storylines have been traditionally bothlinear and deterministic in media such as books, movies, etc. whereas inter-active environments are intrinsically non-linear and non-deterministic. Thisfundamental oxymoron has fuelled considerable interest, leading researchersand practitioners in the videogames area to coin the term interactive sto-rytelling [6]. Formal modelling techniques to shape and design interactivestories are not currently widespread, and that also partially contributes tothe exorbitant budgets required to build a CPRG.

    F. Zambetta: Modelling Interactive Non-Linear Stories, Studies in Computational Intelligence

    (SCI) 96, 119138 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 120 F. Zambetta

    The objective of our research lies on one hand in exploiting dynamical mod-els to lead to a more formal design process for (story-driven) games [13], andon the other on improving the current approaches to interactive storytelling.It is generally assumed that stories and drama are generated by conict, asdetailed by Aristotle a very long time ago [11]. Therefore, we envisage an ex-tension to story-driven games, where not only can players inuence the gamestory, but also the story itself can change under the pressure of political bal-ance. Our work is rooted in Richardsons dynamical model of Arms Race [17],devised to analyze the causes of international conicts, initially applied byRichardson to the World War I scenario. Our modication of Richardsonsfaction model brings numerous improvements over the standard faction mod-els currently used in RPG games.

    First and foremost, such an approach makes more options available toRPG designers that will enable the creation of dierent types of stories thatintegrate political considerations in the plot itself and extend the usual storydriven approaches. By simply varying the basic parameters of the core model,many scenarios can be created that correspond to dierent political status quo(e.g., tense relationships, truce, initial friendly relations, etc.), which can inturn evolve over time in dierent types of equilibria. Secondly, players choiceswill impact the in-game political balance, but at the same time the plot willevolve under the pressure of political events, giving rise to a novel gameplaystyle. The scenario we are working on has been dubbed Two Families. Playerstake the side of one of two inuential families in the ght for supremacy in actional city, and decide whether they want to further their factions politicalagenda or act as a maverick, thus contributing to alter the political balance.

    The remainder of the chapter is organized as follows: Sect. 2 describes somerelated work; Sect. 3 introduces dynamical models, and describes the modi-ed Richardsons model used to compute a political balance among factions.Section 4 details the most relevant scenarios of use for the model; Sect. 5 intro-duces our prototype and the results obtained so far, while Sect. 6 eventuallyoutlines our future work.

    2 Related Work

    The games industry has been quite resistant to the use of advanced intelligenttechniques, relegating the game AI to use ecient but non-scalable computingdevices such FSM (Finite State Machines) or RBS (Rule Based Systems). Thereasons for this are essentially twofold: On one hand, before the introductionof GPUs (Graphics Processing Units), most of the CPU time in an applica-tion was devoted to graphics and rendering. On the other hand, a commonmisconception in the games industry has been that learning and adaptationof characters to their worlds could lead to chaotic and unpredictable behav-iours [9]. Fortunately, with the introduction of GPUs and multi-core CPUs thegame AI can play a bigger role in the game development process. Moreover, a

  • Modelling Interactive Non-Linear Stories 121

    few game development and research teams have started to demonstrate thatthe latter argument is unfounded.

    The work of the Synthetic Characters Lab at MIT [5] represents a funda-mental step forward to produce a practical and stable approach to real-timelearning for virtual characters modelling animals, especially dogs and sheep.In their formulation characters actions selection is driven by a set of rules,called ActionTuples that are comprised of dierent parts, and that tend togeneralize well-known approaches to the solution of the reinforcement learn-ing problem such as TD() or Q-learning [20]. The TriggerContext indicatesexternal conditions that must be met in order for the ActionTuple to be acti-vated; the Action represents what the creature should do if the ActionTupleis active; the ObjectContext describes on what objects the Action can beapplied; the doUntilContext describes the conditions that cause the Action-Tuple to deactivate; the Results slot contains Predictors, trying to estimate(within a condence level) what event will occur next; the Intrinsic Value is amulti-dimensional value describing the ActionTuples perceived eect on thecreatures internal drives. Although the ActionTuple can create some interest-ing behaviour for a synthetic animal and provide a basic form of adaptiveness,it cannot learn how to satisfy high-level goals. For example, a synthetic dogcannot learn the shepherds intention to move the sheep south.

    Lionhead Studios Black & White [9], a so-called god-game, featuresintelligent learning creatures as an integral part of the storyline to be un-ravelled in the game. The player takes the role of a divinity who can trainhis creature, his emissary on the world, to perform tasks on his behalf andultimately expand his community of worshippers. The creatures use a classicBDI (Belief, Desire, Intention) architecture [4], augmented in an ingeniousway: The creature can have Opinions about what objects are most suitablefor satisfying dierent desires. These Opinions are implemented as decisiontrees, whereas desires are implemented via perceptrons, each with a numberof dierent desire sources. Creatures then deliberate about the most importantgoal and the most important type of object to act on. The crucial innovationintroduced by Black & White is the use of heterogeneous sources of learning:Not only do creatures employ reinforcement learning by means of direct feed-back (the player rewards or punishes them), but they also learn by example.Creatures observe the action performed by the player rst, make a guess athis goals, and then construct a belief about the object he was acting on: Thisway the creature can escape local minima it would probably incur otherwise(i.e., it can avoid a rigid behavioural routine, that would lead to stagnationor premature standardization). There are two limitations to this creaturesarchitecture: Creatures can plan at the goal (high) level, but once a goal hasbeen chosen along with a suitable object, the appropriate action for satisfyingthat goal is found in a precomputed plan library and no dynamic planning canbe exploited to full the goal in an alternative way. On top of that, only 40desires are available to the agent and no mechanism is provided to constructnew ones.

  • 122 F. Zambetta

    A consistent number of commercial video games have started to use fuzzylogic instead, most notably best-seller titles such as The Sims or the strate-gic game Civilization: Their main use is devoted to the real-time control ofautonomous agents (The Sims) or for strategic decision making (Civilizationand other similar titles). The appeal of fuzzy logic mainly stems from itshuman-readable form, which makes it a perfect candidate for manipulationby non-technical folk such as game designers. Moreover, fuzzy logic may becombined with an FSM, whose use is ubiquitous in the games industry: FuFSMor Fuzzy Finite State Machines [12] have in fact been used in the already citedThe Sims.

    Story-driven games [7] (the focus of our contribution) have not risen tothe challenge of integrating intelligent techniques, the main reason being thelargely adopted gameplay style. These games tend to convey experiences toplayer that depend on very deep (but linear) pieces of digital narrative: Insuch conditions most of the interaction can be engineered via scripting func-tions implemented in a scripting language of choice. However, such solutionsare not scalable and lead to extremely high production costs: A switch toadvanced intelligent techniques will benet the games industry as a whole.The most relevant example in this area is Facade, an experiment in interactivedrama [14]. The player plays the character of a friend of Grace and Trip, anattractive and successful couple in their thirties. At an evening get-together attheir apartment, the player will witness the high-conict dissolution of Graceand Trips marriage. Players may contribute their own actions to change thecourse of the couples lives and how the whole drama unfolds. Facade sharesthe same motivations of our work, i.e., nding a middle ground between verystructured and constrained game narrative, and sandbox (or strategic) gameswhere agency is of paramount importance but the emerging narrative can sel-dom qualify as such (e.g., it is not dramatic). Their mean to achieve this goaldiers quite substantially from ours, as does the scenario chosen to showcasetheir technical infrastructure. Because their scenario is primarily based aroundconversational agents, they implemented ABL (A Behaviour Language) andNLU (Natural Language Understanding), which are languages for languageplanning and understanding. The most interesting component in their archi-tecture is a drama manager which rearranges elementary components of nar-rative renamed beats (dubbed scenes in our approach) to achieve a dramaticeect. Our approach relies on the use of fuzzy rule sets to manage conictwhich is intrinsically tied to drama [11]. Ultimately, both the approaches aimat recombining chunks of narrative in a way that preserves the dramatic sig-nicance in the storyline, but the means to achieve this are dierent. However,our use of fuzzy logic may render the game designers job easier due to the useof quasi-natural language clauses; also, combining an HCP with a fuzzy rulebase allows us to include more gameplay open-endedness. The use of an HCPcan be in fact generalized and used to manage a potentially innite numberof gameplay features.

  • Modelling Interactive Non-Linear Stories 123

    3 Improving Richardsons Model of Arms Race

    Richardsons Arms Race model was developed by Lewis Fry Richardson pre-dict whether an arms race between two alliances was to become a prelude toa conict. The original model consists of a system of two linear dierentialequations, but it can be easily generalized to a multi-dimensional case [17].Richardsons assumptions about the model are given below:

    Arms tend to accumulate because of mutual fear. A society will generally oppose a constant increase in arms expenditures. There are factors independent of expenditures which conduce to the pro-

    liferation of arms.

    The actual equations describing this intended behaviour are given as

    x = ky ax + g, (1)y = lx by + h.

    The values of x and y indicate the accumulation of arms for each nation.Clearly, we can also rewrite the equations in matrix form yielding, with propersubstitutions:

    z = Az + r, (2)


    A =

    (a kl b

    ), z =



    ), and r =



    ). (3)

    The solutions of the system of linear ODEs (Ordinary Dierential Equa-tions) [2] do not depend much on the values of the constants, but ratheron their relative magnitude, and the signs of g and h, which represent inRichardsons view the grievance terms. The constants k and l are named fearconstants (mutual fear), a and b are the restraint constants (internal opposi-tion against arms expenditures), and as already mentioned, g and h are thegrievance terms (independent factors, which can be interpreted as grievanceagainst rivals). Note that only g and h are allowed to assume negative values.When analyzing the model, one will need to take into account the optimallines (where the rst derivatives of x and y equal 0), the equilibrium pointP*=(x*, y*) where the optimal lines intersect, and the dividing line L* forcases where equilibrium depends on the starting point. Trajectories headingtowards positive innity are said to be going towards an unlimited armamentor a runaway arms race, whereas the ones going towards negative innity aresaid to be going towards disarmament. There are two general cases that canoccur in practice, in the general assumption that detA = 0: All trajectories approach a stable point (stable equilibrium, see Fig. 1a).

  • 124 F. Zambetta

    (a) The system trajectories converge to an equi-librium point.

    (b) The system trajectories depend on the initialpoint, and can lead to dierent outcomes. Thedividing line is also depicted.

    Fig. 1. Possible equilibria for the system

    Trajectories depend on the initial point: They can either drift towardspositive/negative innity or approach a stable point if they start on thedividing line (unstable equilibrium, see Fig. 1b).

    If ab > kl, we will achieve a stable equilibrium: An equilibrium point is con-sidered stable (for the sake of simplicity we will consider asymptotic stabilityonly) if the system always returns to it after small disturbances. If ab < kl,we will achieve an unstable equilibrium: The system moves away from theequilibrium after small disturbances. We will show that a modied version ofthe model can produce alternating phases of stability and instability, yieldingvariable and quantiable results: This can give rise to a richer simulation of

  • Modelling Interactive Non-Linear Stories 125

    faction dynamics, as alliances can be broken and conict be ceased temporar-ily, or even war be declared on a permanent basis.

    Our investigation is aimed at rening Richardsons model for use in aCRPG, and has involved three steps: Reinterpreting the model semantics tot our intended game context, modifying the model to produce a satisfactoryrepresentation of interaction among factions, and nally converting the modeloutput to the input used by a classic CRPG faction system (in our case theNeverwinter Nights 1 or 2 faction system).

    3.1 Reinterpreting the Richardsons Model Semantics

    Even though the model created by Richardson is a viable approach to controloverall factions behaviour in games, the model was designed with a verycoarse level of granularity in mind. Whilst Richardson was interested in thevery high level picture of the reasons behind a conict, our goal is to givedesigners the freedom to change a games story over time. Hence, we startedour analysis by naming two factions X and Y, and by reinterpreting x and yas the (greater than or equal to zero) level of cooperation of faction X and Yrespectively. We also reinterpreted the parameters of the model as listed inTable 1. The meaning of the parameters is not very dierent in our version ofthe model, but increasing values will lead to cooperation instead of conict.

    This change aligns the system with the convention used by the NWN 2faction system. The level of cooperation of each faction will lead either to astable equilibrium point P* that yields a steady state of neutrality, or unstableequilibrium that will drive the system towards increasing levels of competi-tion/cooperation (decreasing cooperation indicates competition). Without lossof generality, we will concentrate on a restricted context of unstable equilib-rium: Richardsons model will be modied in order to obtain a rich behav-iour, and at the same time cater for the interactive scenarios found in modernvideogames. Also, we will assume that g and h are negative (indicating thatthe two factions harbour resentment towards each other).

    3.2 Modifying Richardsons Model

    The standard formulation of Richardsons model in the unstable equilibriumcase implies that the nal state of the system will be dictated by the initial

    Table 1. The reinterpreted parameters semantics

    Parameters Semantics

    k Faction X belligerence factorl Faction Y belligerence factora Faction X pacism factorb Faction Y pacism factorg Friendliness of X towards Yh Friendliness of Y towards X

  • 126 F. Zambetta

    conditions of the system. The initial condition of the system, a point P in thecooperation plane depicted in Fig. 1a,b, will be such that:

    If P lies in the half-plane above the dividing line L*, then the system willbe driven towards innite cooperation.

    If P lies in the half-plane below the dividing line L*, then the system willbe driven towards innite competition.

    If P lies on the dividing line L*, then the system will be driven towards astable condition of neutrality.

    The problem with this model is that it is uninteresting in an interactive sce-nario, even though it apparently contains all the main ingredients required toproduce a rich behaviour: Once an application starts approximating the solu-tion of the model from its initial condition via an ODE solver [2], the solutionwill be stubbornly uniform and lead to a single outcome in any given run (anyof the three listed above, depending on the initial position of P). To cater forscenarios where PCs (Playing Characters) and NPCs (Non-Playing Charac-ters) interact with each other in the game world, we developed a stop-and-goversion of Richardsons model: The solution of the system will be initiallycomputed by our ODE solver until an external event is generated in-game.When that happens, the parameters of the model listed in Table 1 are con-veniently recomputed, leading to a possible change in the equilibrium of thesystem: The way parameters are changed allows for the possibility of mov-ing the dividing line L*, thus altering the direction of motion of the currentsystem trajectory. Recalling (3) we have

    Anew = Aold (4) > 0 .

    Now we want to see how scaling A will inuence the equilibrium of the system.To do so, lets rst compute the equation of L*, which is the locus of pointswhere both the derivatives in our system will go to zero. The equation of L*will result in

    x + y = (ky ax + g) + (lx by + h)= (l a)x + (k b)y + (g + h) (5)= 0 .

    The eect of scaling on A will yield

    x + y = (l a)x + (k b)y + (g + h) (6)= 0 .

    Thus, we will nally have

    (l a)x + (k b)y + (g + h)

    = 0 .

  • Modelling Interactive Non-Linear Stories 127

    Fig. 2. Eect of scaling A

    Three distinct cases will be possible then:

    0 < < 1: L* is moved in its original upper half-plane, giving rise to apossible decrease in cooperation.

    = 1: The scale factor does not change A (there is no practical use forthis case, though).

    > 1: L* is moved in its original lower half-plane, giving rise to a possibleincrease in cooperation.

    To test these claims, the reader needs only to take a look at Fig. 2, wherethe case 0 < < 1 is depicted. The dividing line is initially L1, and thepoint describing the trajectory of the system is P: The ODE solver generatesincreasing values of cooperation stopping at P1, because an external event hasjust occurred. At this stage, A gets scaled and as a result of that, the newdividing line becomes L2: The new dividing line brings P1 in the lower half-plane, leading to decreasing values of cooperation (increasing competition).Generalizing the considerations inferred from this last example, suppose thatinitially L1 P > 0 (increasing cooperation) and that 0 < < 1. Then we willhave three alternatives when an external events occurs:

    L2 P1 > 0: The level of cooperation keeps on increasing. L2 P1 < 0: The level of cooperation starts to decrease. L2 P1 = 0: The level of cooperation will move towards a stable value.Clearly, if L1 P > 0 and > 1 then L2 P1 > 0. Similar conclusions can bedrawn in the case L1 P < 0.

    Hence, any application using our model will need to provide a set (or ahierarchy) of events, along with a relevance level j , j {1 . . .M} that couldbe either precomputed in a lookup table or generated at runtime (-values).Obviously, all the events having j > 1 will correspond to event that facilitatecooperation, whereas events having 0 < j < 1 will exacerbate competition.

  • 128 F. Zambetta

    The eect of the -scaling is to change partitioning of the rst quadrant,giving rise from time to time either to a bigger semi-plane for cooperation orfor competition.

    Finally, the improved Richardsons model presented here can be charac-terized in terms of an HCP (Hybrid Control Problem) [3]. We will not get intomuch detail to avoid losing the focus of our investigation, but suce to saythat an HCP is a system involving both continuous dynamics (usually mod-elled via an ODE) and controls (often modelled via a Finite State Machine).The system possesses memory aecting the vector eld, which changes dis-continuously in response to external control commands or to hitting specicboundaries: Therefore, it is a natural t to treat in-game events like controlcommands.

    3.3 Converting to the Neverwinter Nights 2 Faction System

    Converting the to the NWN 2 faction system is straightforward once theproper values of cooperation have been computed.

    A few function calls are available in NWN Script to adjust the reputa-tion of a single NPC (e.g., AdjustReputation) or of an entire faction (e.g.,ga faction rep). In NWN 2 faction standings assume a value in the [0, 100]range per each faction: Values in [0, 10] indicate competition (in NWN 2 hostil-ity), whereas values in [90, 100] represent cooperation (in NWN 2 friendship).

    The most straightforward conversion possible would simply use x and yas the faction standings for each faction: x would indicate the way NPCs infaction X would feel about people in faction Y and vice versa, clamping thevalues outside the [0, 100] range. Also, a scaling factor that represents the rel-ative importance of each NPC in a faction can be introduced: It is reasonableto expect that more hostility or friendship would be aroused by people in com-mand positions. Hence, if we split a faction (say X for explanatory purposes)in N dierent ranks, then we will have some coecients i, with i {1 . . . N}such that

    xNWN = x i . (7)

    4 Scenarios of Use

    The conceptual framework our model is based on is illustrated in Fig. 3. Thelevel of cooperation (competition) generated by our model is inuenced byplayers actions in game, but the model will alter the game world perceivedby players as well as in a feedback loop. The longer term applications of ourmodel, and the main drivers for our eorts have been navigation and gen-eration of non-linear gameplay. Besides achieving these more complex goalsthough, we also wish to apply our model to the generation of random encoun-ters in a CRPG like Neverwinter Nights.

  • Modelling Interactive Non-Linear Stories 129

    Fig. 3. Our model conceptual framework

    Fig. 4. Representing a games non-linear plot

    4.1 Navigating Non-Linear Game Narrative

    If a game has narrative content arranged in a non-linear story or short episode,we can visualize its structure as a collection of game scenes (see Fig. 4). Eachcircle either represents a scene of the game where choices lead to multiplepaths, or scenes which will just move the storyline along. Also, a start and anend scene will be included.

    We envision attaching scripting logic to each of the nodes where a choiceis possible, so that alternative paths are taken based on the current level ofcompetition. Thus, our players will be able to experience dierent subplotsas a result of their own actions and strategies. From a pragmatic point ofview, exponential growth of non-linear structures has to be kept under controldue to resources implications: A widespread game structure used to preservenon-linear design without leading to unbearable resource consumption, is aconvexity [10]. Each of the nodes containing scripting logic will incorporatefuzzy rules [21], describing which specic actions should be executed basedon the value of fuzzy predicates. We could theoretically use classic logic to

  • 130 F. Zambetta

    Fig. 5. Membership functions to model fuzzy cooperation predicates

    express these conditions, but fuzzy logic is very good at expressing formalproperties using quasi-natural language. For instance, we might have somescripting logic like below:

    IF cooperationX IS LOW THEN Action1


    IF cooperation IS AVERAGE THEN Action2

    Clearly, opportune fuzzy membership functions are needed, and their currentsetup is depicted in Fig. 5.

    The net result will be scripting logic that game designers will be able touse and understand without too much hassle, and which will resemble to someextent natural language.

    In practice it will be very likely to have conditions that contain bothfuzzy cooperation predicates and crisp conditions relating to common in-gameevents such as quests completion, items retrieval, etc. in order to trigger scenetransitions. Ultimately, the goal we have in mind is to render a new gamegenre viable, i.e., RPS (Role-Playing Strategic). The best of both worlds,Role-Playing Games and Real Time Strategics, is pursued here as a blendingof the classic story-driven approach familiar to RPG players with strategicgameplay elements.

    4.2 Generating Random Encounters in Neverwinter Nights

    Random encounters are common place in RPGs, for example to attenuate themonotony of traversing very large game areas. Their main potential aw is

  • Modelling Interactive Non-Linear Stories 131

    that attentive players will not suspend their disbelief, because creatures couldbe spawned without any apparent rationale at times. Our model can generatevalues of cooperation/competition over time, and these can be used as cuesfor the application to inform the random encounters generation process.

    Supposing we are in a scenario where players joined faction X, their ac-tions will cause specic in-game events able to inuence the equilibrium of thesystem. Now, the higher the level of competition of X towards Y, the harderand the more frequent the encounters will be. Also, players will encounterNPCs willing to negotiate truces and/or alliances in case the level of coopera-tion is suciently high, in order to render the interaction more believable andimmersive. The way this improved process for random encounters generationcan be designed is by using fuzzy rules, describing which class of encountersshould be spawned based on the level of cooperation.

    Possible rules will resemble this form:




    Such a mechanism could be used to deter players from using a pure hack-and-slash strategy forcing them to solve puzzles, and concentrate on the storylinenarrated in game.

    It should be noted that NWN 2 already provides ve classes of standardencounters (very easy, easy, normal, hard, very hard), but they all implic-itly assume players can only take part in hostile encounters. Ultimately, weenvision to extend the existing set of encounters with other ve classes of en-counters tailored to negotiation Moreover, the grain of the classes is coarse anda proper defuzzication mechanism could use some of the parameters includedin the classes (e.g., number of monsters spawned, etc.) to render it ner. Asdictated by our conceptual framework, not only will players be able to inu-ence the level of competition in-game, but they will also experience rst-handthe eect of the model on the random encounters in the game world.

    4.3 A Tool to Create Non-Linear Stories

    A tool to create non-linear stories would allow game designers to both inter-actively script the game structure, and make changes to the structure itself.In order to restructure the game narrative it is foreseen that a more complexlanguage will be needed that not only will be able to describe the choicesoccurring in the storyline, but also script more generic game events. The sim-plest (and probably most eective) idea we have been thinking about wouldsee the fuzzy rules systems incorporated through an API exposed by a moregeneric games-friendly scripting language (e.g., Python, Lua, Javascript, etc.).

    An example of a language used to script narrative content is given by ABL,a reactive-planning language used to script the beats (dramatic units) in the

  • 132 F. Zambetta

    interactive drama Facade [14]. Even though ABL did a good job in scriptingFacade dramatic content, it clearly falls short in terms of complexity of thescriptable actions: All in all, Facade is a piece of interactive drama with aquite sketchy 2D interface, and not a real game (which is what we are reallyinterested in).

    Also, people at the University of Alberta proposed an approach based onsoftware patterns to help game designers in story building [8]: Scriptease, thetool they produced, can be used to automate to some extent the scripting oftypical narrative and interaction patterns in Neverwinter Nights. The conceptof a formal structure underpinning a story is not new at all, as it was rstanalyzed at large by Propp in relation to traditional Russian folktales [16].Despite some criticism to Propps work, it is our intention to incorporatethe core of its arguments to be able to recombine essential story elements inmultiple ways: This could lead to the generation of new storylines, which canthen be manually rened by game designers and writers with less eort. Idealcandidates for this task are represented by evolutionary algorithms, whosepower of recombination driven by an automatic or semi-automatic tnessprocedure has been applied to music [15] or graphics [19] and animation [18].Of course, building a tool to forge non-linear stories is a far-reaching goaloutside the scope of our current research, but an intention in our future work.

    5 Experimental Results and Discussion

    We have not built an entire scenario integrating all the features of our modelyet; hence, we are going to present some results obtained simulating in-gameexternal events via random number generators. We will analyze the solutionsgenerated by the ODE when selecting specic parameter sets. We will examinethe cases listed below:

    1. The strong impact on the system of Richardsons model parameters set.2. The marginal relevance of dierent starting points.3. The role of events probability distribution, and the correlation with -


    Moreover, we will provide an example of interaction between fuzzy rules andthe solution computed by the system in a specic scenario: The players areapproaching an NPC, and its attitude towards them depends on the currentlevel of competition between their respective factions (and the fuzzy rules).However, before illustrating our results we will provide some necessary clari-cations on the experimental data.

    Firstly, the system trajectories are constrained in a subset of the rstquadrant (I = [0, 100] [0, 100]). Positive values are needed for both x andy as they represent levels of cooperation. Besides, NWN 2 accepts reputationvalues in the range [0, 100] with lower values indicating a tendency to conictand antagonism. Secondly, we assumed that if the cooperation value of any

  • Modelling Interactive Non-Linear Stories 133

    faction falls outside the prescribed range it will be rst clamped, and after acertain amount of time reset to random coordinates representing neutrality.This assumption makes sense as we do not want to keep the system in adeadlock for too long a time. The formulas currently used for resetting thesystem trajectory are

    x = 50 + 25 (0.5 r), (8)y = 50 + 25 (0.5 r).

    Here r is a random number in the [0, 1] range. Clearly, other formulas couldbe used, but this method produces interesting and robust results. Our ODEsolver, implemented using a Runge-Kutta order 2 (or mid-point) method, hasbeen hooked to the OnHeartbeat event in NWN 2 (invoked every 6 s). Thestate of the system was sampled over 5000 iterations, resulting in a time spanof around 8.3 hours of real-time.

    5.1 ODE Solutions

    Changing the fundamental parameters of the model gives rise to the situationdepicted in Fig. 6ac. Increasing the magnitude of the parameters has the ef-fect of causing the system trajectory to bounce more often o the borders, andbeing randomly reset to a new position. In practice, the smaller the coecientsthe more deterministic the system will be. This can allow game designers tone tune the parameters value to obtain dierent political scenarios in theirstorylines, being still able to predict the average behaviour of the system.

    The marginal role played by starting points on the long term behaviour ofthe system is no surprise. Given the random nature of the system (induced byexternal events and the reset mechanism) the starting point becomes a smallfactor in the whole picture.

    On the other hand, a very important role for the system behaviour is as-sumed by the events probability distribution. We examine a case where onlythree possible events are allowed: One intensifying the cooperation level, theother weakening it, and a last one corresponding to a null event. The eectof this probability distribution is provided below in Fig. 7. If we increase theprobability of one event over the other then we will witness either the systemtrajectories gathering around the origin (uttermost competition) or the oppo-site corner (total cooperation). This conclusion is true in a probabilistic senseonly, because the system can still go through alternating phases. By adjustingthe probability distribution a game designer can adjust the likelihood of ascenario to lean towards cooperation or competition.

    Finally, the values of for each coecient play a role similar to the oneof the probability distribution (see Fig. 8). Intuitively, the probability distri-bution acts as a set of weights for the -values even though a formal proof ofthis argument still needs to be provided.

  • 134 F. Zambetta

    (a) A very simple system trajectory.

    (b) A more complex trajectory.

    (c) A very complex and non-deterministicsystem trajectory.

    Fig. 6. Increasing the magnitude of the parameters causes the system trajectory tobounce more often o the borders

    5.2 The Role of Fuzzy Rules

    In Sect. 4.1 we have described an approach to navigating non-linear narrative.We present here a scenario based on such ideas that can shed light on the useof fuzzy rules in our system. We will suppose our ODE is computing a solutionover time using a specic parameter set determined using the guidelines given

  • Modelling Interactive Non-Linear Stories 135

    Fig. 7. The eect of a probability distribution P = {0.05, 0.25, 0.7}

    Fig. 8. The eect of = {0.025, 1.05}

    Fig. 9. Dierent branches of the story are taken because of dierent levels of coop-eration

    in the previous subsection. Fuzzy rules are created to provide control over thegame story progression. The level of competition in the game will be inuencedby the events generated by PCs and NPCs, and this in turn will cause thestory to be channelled to specic branches whose logic is controlled by therules (see Fig. 9).

  • 136 F. Zambetta

    For instance, suppose a specic scene of the game revolves around therelationship between the PC and an inuential NPC. This character will tendto approach hostile and violent PCs with servile disposition while reactingwith hostility to friendly players, perceiving them as weak. Neutral playerwill be treated with neutral distrust. The rules used in this case are:




    Clearly, coopX is a predicate describing the PC faction predisposition towardsthe NPC, and vice versa for coopY. The fuzzy membership functions used areportrayed in Fig. 5. This simple setup is sucient to allow for distinct outputsto be generated that result in dierent routes for the storyline, and hedgeoperators were not necessary in this specic situation. Figure 10 a,b show theoutput surface of the fuzzy inference, and an evaluation example.

    (a) The output surface.

    (b) An evaluation of the fuzzy inference.

    Fig. 10. Our fuzzy rules in action

  • Modelling Interactive Non-Linear Stories 137

    5.3 Discussion

    We plan to analyze the output of the ODE in more depth: More classes ofevents or more complex probability distributions may lead to more interestingbehaviour but possibly at the expense of too much complexity. The interactionbetween the ODE and the fuzzy rules presented here will be further testedand rened. Ultimately, the approach seems to oer very compelling featuresthat may lead to its adoption in real world projects:

    1. The ODE output produces variable but stable behaviour that can betweaked at will by game designers and programmers.

    2. The fuzzy rules needed to navigate game storylines tend to be simple, andthey are easily modied even by game designers because of their expressivepower.

    3. Fuzzy rules also allow for smooth control over the dierent routes availablein a game story.

    6 Conclusions and Future Work

    We introduced our modied version of Richardsons model which, based on astop-and-go variant, provides game designers with a tool to introduce politicalscenarios in their story-driven games and game mods [1]. We have discussedthe formal properties of the model (that can be more formally regarded asa Hybrid Control Problem), and analyzed some stochastic patterns that arelikely to be generated by the factions behaviour. We also analyzed how suchpatterns can interact with a scripting model based on fuzzy rules.

    The next step in our work will entail the production of Two Families,a Neverwinter Nights 2 module designed to showcase the properties of ourmodel. Two Families will incorporate both random encounters and a non-linear story as described in this chapter. Clearly, the interaction between theODE and the fuzzy rules will be further rened and improved to cater for thisreal-world scenario. Finally, a validation of the whole framework from a userinteraction perspective will be conducted.


    1. Denition of a game mod. http://www.answers.com/topic/mod-computer-gaming.

    2. W. Boyce and R. DiPrima. Elementary Dierential Equations and BoundaryValue Problems. Wiley, Hoboken, 2004.

    3. M. S. Branicky. General hybrid dynamical systems: Modeling, analysis, andcontrol. In Hybrid Systems, pages 186200, 1995.

    4. M. E. Bratman. Intentions, Plans, and Practical Reason. Harvard UniversityPress, Cambridge, MA, 1987.

  • 138 F. Zambetta

    5. R. Burke and B. Blumberg. Using an ethologically-inspired model to learnapparent temporal causality for planning in synthetic creatures. In First In-ternational Joint Conference on Autonomous Agents and Multiagent Systems,pages 326333, 2002.

    6. C. Crawford. Chris Crawford on Interactive Storytelling. New Riders, Berkeley,2003.

    7. C. Crawford. Chris Crawford on Game Design. New Riders, Berkeley, 2004.8. M. Cutumisu, C. Onuczko, D. Szafron, J. Schaeer, M. McNaughton, T. Roy,

    J. Siegel, and M. Carbonaro. Evaluating pattern catalogs the computer gamesexperience. In Proceedings of the 28th International Conference on SoftwareEngineering (ICSE 06), pages 132141, 2006.

    9. R. Evans. AI Game Programming Wisdom, chapter Varieties of Learning, pages567578. Charles River Media, Hingham, 2002.

    10. N. Falstein. Introduction to Game Development, chapter Understanding Fun:The Theory of Natural Funativity, pages 7197. Charles River Media, Hingham,2005.

    11. G. Freytag. Freytags Technique of the Drama. Griggs, Boston, 1995.12. D. Fu and R. Houlette. AI Game Programming Wisdom 2, chapter The ultimate

    guide to FSMs in games, pages 283302. Charles River Media, Hingham, 2004.13. R. Hunicke, M. LeBlanc, and R. Zubek. MDA: A formal approach to game design

    and game research. In Proceedings of the AAAI-04 Workshop on Challenges inGame AI, pages 15, 2004. Available online at http://www.cs.northwestern.edu/hunicke/pubs/MDA.pdf.

    14. M. Mateas and A. Stern. Structuring content in the facade interactive dramaarchitecture. In AIIDE, pages 9398, 2005.

    15. E. R. Miranda and A. Biles, editors. Evolutionary Computer Music. Springer,New York, 2007.

    16. V. Propp. Morphology of the Folktale. University of Texas Press, Austin, 1968.17. L. Richardson. Arms and Insecurity. Boxwood, Pittsburgh, 1960.18. K. Sims. Articial evolution for computer graphics. In Proceedings of the SIG-

    GRAPH Conference, pages 319328, 1991.19. K. Sims. Evolving virtual creatures. In Proceedings of the SIGGRAPH Confer-

    ence, pages 1522, 1994.20. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press,

    Cambridge, MA, 1998.21. L. Zadeh. Outline of a new approach to the analysis of complex systemss. IEEE

    Transactions on Man, Systems and Cybernetics, 3:2844, 1973.

  • A Time Interval String Model for Annotatingand Searching Linear Continuous Media

    Ken Nakayama1, Kazunori Yamaguchi2, Theodorus Eric Setiadi3, YoshitakeKobayashi3, Mamoru Maekawa3, Yoshihisa Nitta4, and Akihiko Ohsuga3

    1 Institute for Mathematics and Computer Science, Tsuda College2-1-1 Tsuda-cho, Kodaira-shi, Tokyo 187-8577, Japanken@ohsuga.is.uec.ac.jp

    2 Graduate School of Arts and Sciences, The University of Tokyo3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japanyamaguch@graco.c.u-tokyo.ac.jp

    3 Graduate School of Information Systems, University of Electro-Communications1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, Japaneric@ohsuga.is.uec.ac.jp, yoshi-k@ohsuga.is.uec.ac.jp,maekawa@ohsuga.is.uec.ac.jp, akihiko@ohsuga.is.uec.ac.jp

    4 Department of Computer Science, Tsuda College2-1-1 Tsuda-cho, Kodaira-shi, Tokyo 187-8577, Japannitta@tsuda.ac.jp

    Summary. Time ow is the distinctive structure of various kinds of data, such asmultimedia movie, electrocardiogram, and stock price quote. To make good use ofthese data, locating desired instant or interval along the time is indispensable. Inaddition to domain specic methods like automatic TV program segmentation, thereshould be a common means to search these data according to the changes along thetime ow.

    In this chapter, I-string and I-regular expression framework is presented to-gether with some examples and a matching algorithm. I-string is a symbolic string-like annotation model for continuous media which has a virtual continuous branch-less time ow. I-regular expression is a pattern language over I-string, which is anextension of conventional regular expression for text search. Although continuousmedia are often treated as a sequence of time-sliced data in practice, the frameworkadopts continuous time ow. This abstraction allows the annotation and searchquery be independent from low-level implementation such as frame rate.

    1 Introduction

    When processing data with time ow, such as movies and continuouslyobserved data from sensors, the order of what happened and their time lengthare the most important characteristics. Common model and tools depicting

    K. Nakayama et al.: A Time Interval String Model for Annotating and Searching Linear

    Continuous Media, Studies in Computational Intelligence (SCI) 96, 139163 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 140 K. Nakayama et al.

    these characteristics will be a good basis for managing such data, allowing theusers concentrate on domain specic analysis of the data.

    In this chapter, I-string and I-regular expression framework is presented,that are a symbolic string-like annotation framework for continuous mediaand a pattern language over the annotation, respectively.

    1.1 Linear Continuous Media

    Continuous time ow is one of the most prominent and universal structureof our world. Linear continuous media is a data which has continuous linear(not branching) time ow as its structure, such as multimedia stream andscientic monitoring data. There are various continuous time media. Videoand audio streams are used as carriers for a wide range of contents such asnews, drama, and music. There also exist domain specic continuous mediasuch as earthquake waveform in seismology, electrocardiogram, and nancialstock quote. State-of-the-art technology of capturing, mass storage, and broad-band communication makes a considerable amount of such media available.

    The real value, beyond the basic availability just as a collection of media,will be in the capability of accumulating knowledge on those media as anno-tation. This enables searching through the media in response to a variety ofusers requests. Now, the demand is on an eective way for searching a portionin the current interest from the archives. To eectively use such bulky archives,a concise way for searching and editing continuous media is indispensable.

    1.2 Motivation

    The most important characteristics for continuous media are the order of whathappened, and their time length. It is very common to process continuousmedia depending on such conditions. For example, one may characterize thescoring scene of a video taped soccer game by large white rectangular (thegoal gate) appearing in the scene longer than 1.2 s followed by slowly changingframes longer than 2.3 s (slow replay). One may characterize his own buysignal for a stock by the time when the price goes down or up less than0.2% longer than ve business days.

    The process can be divided into domain specic analysis and order-and-time related analysis. In the above examples, recognize white rectangle,recognize slowly changing frames, or recognize daily stock price change lessthan 0.2% up are domain specic. On the other hand, recognize the timethat is longer than. . . are order-and-time related analysis. If provided a goodframework and tools for order-and-time related processing, user can easilydene, modify, and combine these conditions. This allows the user concentrateon domain specic analysis. This overview is shown in Fig. 1.

    In computers, continuous media is usually treated as a sequence of dis-crete data. For example, a movie is a sequence of frames. Explicit discretetreatment introduces undesirable, not-essential articial quantization in the

  • A Time Interval String Model for Continuous Media 141

    b+ (m m *|n n *) (a+ | b+)*long (0, 1.4]


    a2.8 a7.1 a3.9b3.8 b1.1 b1.1m n n m

    Modeled as continuous media by abstracting frame rate

    Annotation as an I-string produced by the analyzer

    Stock price quote

    Video stream

    Trajectory of pedestrianMedia Domains

    Raw data

    Domain specific analyzer

    Continuous time flow

    Domain Specific Analysis

    Order-and-Time Analysis

    Match 3Match 1

    Match 2

    I-regular expression (pattern against I-string)

    Fig. 1. The overview of the scheme

    time ow. This prohibits the clear separation between abstract editing seman-tics and low level implementation such as frame rate. We do not impose suchrestriction on the model.

    Continuous media is often edited, cut at some point, extract some part,change order, and concatenated. Basically, annotations associated with themedia should be retained throughout these operations. We would like to makethe model realize this naturally.

    1.3 Design of the Framework

    As an intuitively natural form of annotations, we have adopted a linear stringthat has clear correspondence to the original continuous media. When cut-ting and concatenating a continuous media, annotations are retained, too, byperforming the operations parallel to the original ones on the media.

    To make the annotation independent from the low level frame rate,the annotation string should be virtually continuous, that is, the annotationcan be divided at any time position. By abstracting the low level quantized

  • 142 K. Nakayama et al.

    representation of time ow, operations dened on the virtual continuous timecan be applicable to media regardless of its frame rate.

    Based on the observation that annotation for both an interval of and a spe-cic point of continuous media is necessary, we identify two types of attributesymbols: a symbol for a nite time interval, and a symbol for an instant. Forexample, when a domain specic recognizer locates a time position to cut themedia, annotation on that instant of the time is necessary. On the other hand,a recognizer may identify some interval of time in which the media satisessome conditions, for example, the temperature is below the dew point for therecord of a thermal sensor. Thus, annotation for an interval is necessary.

    As a pattern language for the annotation string, conventional regular ex-pression is extended. Conventional regular expression [6] is a commonly usedtool for searching character strings. This makes the pattern language easierto understand and learn to use.

    This chapter presents a framework in which users can express their inten-tions easily and naturally. This work provides complete matching algorithmby extending our previous work [4]. Emphasis is put on the logical expressive-ness. We do not discuss a fancy user interface, nor a physical access methodin this chapter.

    2 Annotation and Search Model for Continuous Media

    The proposed framework for annotating/manipulating linear continuous me-dia consists of two models (1) I-string annotation model and (2) I-regularexpression search model based on the I-string annotation. The frameworkprovides a concise way for searching and editing the media. I-string repre-sents continuous medias content reecting the purpose of the search done onit. I-regular expression is a pattern language for I-strings. Since the syntaxand intuitive semantics of I-regular expression is similar to the conventionalregular expression, it would be easy to use.

    Annotation should be able to refer to some temporal part of a media,since linear continuous media is characterized by its time coordinate. I-stringannotation is in the form of string which consists of two types of annota-tion symbols: one for attributes at a specic moment and the other for atime interval on a media. In the framework, a continuous media is annotatedwith descriptive information, or attributes, reecting its content and purpose.These attributes may be given manually or by automatic indexing [9]. Theway of expressing a raw continuous media data as an I-string is beyond thescope of this work. Attribute may be extracted from the raw data either man-ually or automatically by image processing, or is elaborated by a specializededitor as additional value. So, the assumption that we are going to work on theattributes is not unrealistic and most existing system relies on the assumption.

  • A Time Interval String Model for Continuous Media 143

    2.1 I-String: D-Symbol and I-SymbolAn I-string is a string of two types of symbols, namely I-symbol and D-symbol. An I-symbol for example v2.6, has a positive time duration, depict-ing the attribute for that time interval, while a D-symbol, for example g,represents the attribute of an instant. Without D-symbol, we have to assignan articially small amount of time for an instant event. So, these two typesof attributes are mandatory for modeling a continuous media naturally.

    The sux of an I-symbol is called time duration, which represents the timeduration of the event and should be a positive real number. For example, thetime duration of v2.6 is 2.6. Within the order-and-time related analysis, eachof symbols like v and g is treated as just a symbol to which any meaning(attribute) can be associated by domain specic analyzers, say the content isdrama for a TV program, the position is in a specic area for a trajectoryof a moving robot. I-string annotation model is illustrated in Fig. 2.

    Now, consider the video data in which (commercial) ad. lasts 2min, dramalasts 10min, ad. lasts 6min, drama lasts 2min, ad. lasts 5min, drama lasts14min, and ad. lasts 6min as illustrated in Fig. 3a. Let v be the symbol forthe ad. attribute and w for the drama attribute. Suppose that a machine withtwo states v and w is recorded onto a 55-min movie. The state changes asthe time advances. Using I-symbols v+ and w+, the annotation would be anI-string v2 w10 v6 w12 v5 w14 v6.

    Continuous time flow

    a2.8 a7.1 a3.9b3.8 b1.1 b1.1m n n m

    Video stream

    Annotation as an I-string

    Annotation Model


    D-symbolAttribute for that instant

    Attribute for that interval Time duration

    Fig. 2. I-string annotation model

    v2 w10 v6 v5 v6w12 w14


    ev2 w10 v6 v5 v6w12 w9 w5


    Fig. 3. Symbolic representation of a continuous media

  • 144 K. Nakayama et al.

    In addition to the state changes, an event occurring at a specic momentcan be represented with D-symbols. For example, we may use an attribute efor marking the climax (Fig. 3b). If the climax event e comes at 5min from theend of the 14-min drama fragment, the I-symbol w14 is split into w9 and w5,then D-symbol e is placed between them, getting w9 e w5. Notice that thetime duration of w9 e w5 remains 14. In this way, we can construct a stringof attributes for describing the continuous media. Here, we assume that eachattribute is represented by a symbol.

    2.2 I-Regular ExpressionI-regular expression is a search pattern against I-string, which providessimple but expressive means for continuous media searching. Once contin-uous media are annotated with I-strings, a variety of search can be done us-ing I-regular expression based on that attributes. Suppose that progress andtreatment of a patient is described as an I-string by encoding the patientscondition with a and b, representing normal and critical, respectively,and giving a tablet of two types of medication with m and n, respectively.Then, an I-regular expression query

    b+(mm | nn) (a+ | b+)(0, 1.4]


    matches a part of the record where the patient is in the critical condition,then given one or more tablets of one of the two types of medicine, but notmixed, together with maximum 1.4 h of the progress after the treatment. Thesearch model is illustrated in Fig. 4.

    I-regular expression can specify (1) order of occurrences of symbols, to-gether with (2) constraints on the time duration of a specic portion. As thedomain of time, we adopt a real number, and for an interval, we adopt an in-terval of real numbers. Some system [8] limits the time domain to the integralnumbers implicitly. This may cause inconvenience when media with dierenceframe rate are treated together.

    I-regular expression is an extension of the conventional regular expres-sion [6]. The extensions are (1) symbols which match any symbol length of I-symbol, and (2) time interval constraints (I-constraint). D-symbols are equiv-alent with conventional symbols, or string characters. In other words, if youuse only D-symbols and other constructs but I-constraint, that is equivalentto the regular expression with which you are familiar in text editors. Sincethe regular expression is commonly used for specifying patterns for characterstrings, the proposed search framework would be easy to understand for awide range of users.

    The pattern matcher, presented in the later sections, enumerates possiblematches of the given I-regular expression and I-string, and extracts cor-responding sub-I-strings if extraction directive is specied in the I-regularexpression.

  • A Time Interval String Model for Continuous Media 145


    Search Model


    Matches arbitrary positive time duration

    b+ (m m *|n n *) (a+ | b+)*long (0, 1.4]



    I-regular expression (pattern against I-string)

    longAs long as possible m

    Matches exactly one D-symbol


    Repetition of 0 or more times

    I-symbol can be divided at any position

    | Choice

    (0, 1.4]Longer than 0, and Shorter than or equal to 1.4

    Match 3

    Match 1Match 2

    Fig. 4. Search model using I-regular expression

    2.3 Related Work

    In OVID [5], an interval is expressed as a sequence of video frames, and theoperations of the interval logic [1] are used for their manipulation. This methodis simple, but for a user, to say the drama lasts 30min is better than to saythe drama lasts from frame#:2540 to frame#:3439. So, the frame orientedtime model is not suitable for a users query description. In OVID, a kind oflogical time can be expressed by an attribute Year and its value 1974,however, the time oriented semantics of the attribute domain is not explicitlystated. In [8], for an event calculus, logical time duration is used. The logicaltime duration is independent of the underlying sampling rate. This propertyis suitable for the use in queries. Using the logical time, we encode the partthat an attribute a lasts for lmin by al.

    The nite automaton [6] is a well-known machine model for the regularexpression. The extension of the nite automata theory is shown in the lastpart of Sect. 4. The automaton shown in Sect. 4 is nondeterministic and thestandard determinization procedure known as the subset construction is notapplicable because the alphabet in our model is not a nite set. We developedan eective depth rst search procedure for determining acceptance/rejectionin Sect. 5. In the remaining of this chapter, we describe the framework ratherformally, so that it serves as the foundation for further study.

  • 146 K. Nakayama et al.

    3 I-String Annotation Model for Continuous Media3.1 I-StringI-string is an annotation model for continuous media. There are two types ofannotation symbols: one for an instant , and the other for an interval of time.Let and + be mutually disjoint ( + = ) nite sets of symbols.Each symbol in is called D-symbol (discrete), which denotes an event fora specic instant. Each symbol in + is called I-symbol (interval), whichdenotes a state that lasts for some time duration. For clarity, D-symbol ismarked with like m, n, while I-symbol is written with + like v+, w+.

    I-string over alphabet (, +) is a sequence of symbols 12 n (i +, 1 i n) with associated time durations d1d2 dn. The timeduration is also called I-length. Since denotes an instant, di = 0 if i ,while di > 0 for i +. Notice that 0 and negative time duration is notallowed for symbols in +. The empty string is .

    As a simple notation for I-strings, we will omit time duration 0 for and write di in place of + for +. For example, when = {m, n} and+ = {v+, w+}, v5.5v0.2mnw1.3 is a shorthand for the following I-string:

    i 1 2 3 4 5Symbol v+ v+ m n w+I-length (time duration) di 5.5 0.2 0 0 1.3

    Some examples of I-string are as follows:

    , m, mmn, v1w2, v5.5v0.2mnw1.3, mv3m

    If I-strings 1 and 2 are the same as a sequence of symbols, we denote itby 1 2. Two kinds of lengths are dened for an I-string. For an I-string, its symbol length refers to the number of symbols, while its I-length I() isthe sum of time durations. For example, when = u5.4 u3.5 gv1.8 mm v5.9,its symbol length is 7 and I() = 5.4+3.5+1.8+5.9 = 16.6. We assume thatI-string has nite symbol length and I-length.

    3.2 I-Normal Form of I-StringOne of the basic operations necessary for editing I-strings is concatenationand substring extraction. We would like to introduce intuitive naturalness intothe interpretation of I-string. Suppose that we have a 60-min long surveillancemovie taken at a trac crossing. If there is no trac accident during that60min, its annotation would be v60 by denoting no-accident situation withv+. If you cut the movie into two short movies with 25min and 35min, theirannotations should be v25 and v35, respectively. If you concatenate these twoshort movies, you would expect to get the original 60-min long one whose

  • A Time Interval String Model for Continuous Media 147

    Table 1. Equivalent I-strings with respect to =I-string Symbol lengthv5mv1 3 (minimum) I-normal formv2 v3mv1 4v1 v4mv1 4v4.5 v0.1 v0.4mv0.5v0.5 6

    annotation should be v60. If v3.2 is followed by v3.8, we see that the attributev lasts in 7 unit time without interrupts. So, we may identify v3.2v3.8 with v7.

    This suggests that an I-symbol, say v7, should be able to be arbitrarilydivided into v3v4 or v1.8v3.1v2.1, or concatenated back to the original I-symbolas long as its I-length remains the same. To reect this, we introduce anequivalence relation = over I-strings. For any successive occurrence of thesame I-symbols in an I-string, such as vd1vd2 vdm and ve1ve2 ven , theyare equivalent to each other with respect to = i their sums of I-length arethe same each other:

    vd1vd2 vdm = ve1ve2 ven , where d1 + + dm = e1 + + en. (2)Among the equivalent I-strings with respect to =, there exists a

    unique I-string which has the minimum symbol length. We call such anI-string I-normal form. You can get the I-normal form by merging allsame I-symbols appearing adjacently. For example, the I-normal form ofv4.5 v0.1 v0.4mv0.5v0.5 would be v5mv1 (Table 1).

    On the contrary, no such relation is dened for successive D-symbols. D-symbols are not dividable, that is, the number of occurrence, for example3 for mmm, is signicant.

    4 I-Regular Expression4.1 I-Regular Expression and Its Regular LanguageI-regular expression is a pattern for I-strings. I-regular expression r repre-sents a set of I-strings, L (r), which is called the regular language denedby r. I-regular expression over alphabet (, +) is dened recursively asshown in Table 2. I-regular expression matches to an empty I-string , aD-symbol m as an I-regular expression matches exactly on occurrence of min an I-string, and an I-symbol v+ matches arbitrary positive time durationof that symbol in an I-string. These primitive I-regular expressions can becombined by choice, concatenation, or repetition operators recursively.

    I-symbol and I-constraint are the extensions, and the remaining is thesame with the conventional regular expression. I-symbol as an I-regular ex-pression matches arbitrary I-length of the I-symbols in an I-string.

  • 148 K. Nakayama et al.

    Table 2. Denition of I-regular expression over alphabet (, +)

    I-regular expression Regular language(r, r1, and r2 are I-regular expressions) (set of I-strings)Empty I-string L () = { }D-symbol m L (m) = {m }I-symbol v+ + L (v+) = { vl | 0 < l }

    Choice (r1 | r2) L ((r1 | r2)) = L (r1) L (r2)Concatenation (r1r2) L ((r1r2)) = L (r1)L (r2)Repetition (r) L ((r)) = L () L (r) L (rr)

    For a non-negative continuous interval ,

    I-constraint r



    )= { | L (r) , I() }

    I-constraint restricts the I-length of the specied part (sub-I-regular ex-pression) of I-string. is a non-negative real interval such as (0, 3.1] or[22.9, ). Each end point at the interval boundary may either be open orclosed, independently. Parentheses may be omitted unless it becomes ambigu-ous. For example, parentheses in (rs) should not be removed since it can beconfused with r(s). If we want to disregard m and identify v9mv6 with v15,we can use the following pattern:

    (v+ | m)[15, 15]

    . (3)

    Some other examples of I-regular expression and corresponding regular lan-guage are shown in Table 3. Notice that I-strings are compared based on theequality = in L (r).

    4.2 Reference to Substring of Match Result

    The use of I-regular expression is twofold (1) Yes/No query, and (2) substringreference to the matching result. We say I-string matches the pattern ri L (r), otherwise does not match r. The simplest type of queryis Does I-string match the pattern r? The answer would be Yes or No.Suppose that there is a movie of a car race for three cars a, b, and c. If we areinterested in the change of the leading car during the race, it can be annotatedas an I-string over alphabet (, +) = ({}, {a+, b+, c+}). For example, theannotation might be an I-string below:

    c8 a12 b4.5 c1.8 b0.5 a6.3 c11 b14.8 c2 . (4)

    Does b win the game? The leader at the end of the race is the winner. Ifthe I-string matches the following I-regular expression, the answer is Yes.

    (a+ | b+ | c+)b+ . (5)

  • A Time Interval String Model for Continuous Media 149

    Table 3. Examples of I-regular expression and its regular languageI-regular expression r Regular language L (r) {}m {m}v+ {vd | 0 < d}v+mv+ {vd1mvd2 | 0 < d1, 0 < d2}v+ | m {vd, m | 0 < d}m {, m, mm, mmm, . . .}

    v+[5.7, 5.7]


    (0, 2.93){vd | 0 < d < 2.93}

    v+[0.77, 0.77]

    {, v0.77, v0.77v0.77, v0.77v0.77v0.77, . . .}= {, v0.77, v1.54, v2.31, . . .}

    v+[0.77, 0.77]



    {, vd11 , vd21vd22 , vd31vd32 , vd33 , . . . | 7.18 < dij}= {, ve | 7.18 < e}

    v+(6, 8]

    {, vd11 , vd21vd22 , vd31vd32 , vd33 , . . . | 6 < dij 8}= {, ve1 , ve2 , ve3 | 6 < e1 8, 12 < e2 16, 18 < e3}(

    v+[2.51, 2.51]


    n {n, v2.51mn, v2.51mv2.51mn, . . .}(v+

    [2, 2]

    mm v+(0, 1)

    ){, v2mmve1 , v2mmvd21mmve2 , . . .

    | 2 < dij < 3, 0 < ei < 1}

    Show me the portion that c grabs the top from b, but b takes it back inless than 3min.

    (a+ | b+ | c+)(b+ c+(0, 3)

    b+)(a+ | b+ | c+) . (6)

    Does c keep the leader for more than 10min?

    (a+ | b+ | c+) c+(10,)

    (a+ | b+ | c+) . (7)

    For the above query, you might want to watch the scene when c is beingthe leader. For an I-string L (r), r can be used to designate substringsof interest for extraction from . The substring reference and the matchingdirectives are used for this purpose. To refer to a substring which matchedsubpattern s, we use the reference


    , (8)

  • 150 K. Nakayama et al.

    where X is an arbitrary name for this reference. For instance, after matchingthe following I-regular expression, the matched substring can be referred by X:

    (a+ | b+ | c+) c+[10,)< X >

    (a+ | b+ | c+) . (9)

    Show me the heated battle of b and c. Show me the portion that c grabs the top from b.

    (a+ | b+ | c+) b+c+< X >

    (a+ | b+ | c+) . (10)

    Show me the portion that b or c grabs the top from the other.

    (a+ | b+ | c+) (b+c+ | c+b+)< X >

    (a+ | b+ | c+) . (11)

    Show me the portion that b or c runs on top, and each keeps the top forless than 10min.

    (a+ | b+ | c+)a+ b+ | c+(0, 10) | U | V< X >

    a+(a+ | b+ | c+) , (12)


    U b+(0, 10)

    c+(0, 10)


    (0, 10)c+

    (0, 10)


    (0, 10)|


    V c+(0, 10)

    b+(0, 10)


    (0, 10)b+

    (0, 10)


    (0, 10)|


    Intuitively, U represents alternating sequences starting with b+:

    {b+c+, b+c+b+, b+c+b+c+, b+c+b+c+b+, . . .},and V represents similar ones starting with c+.

    Show me the winner.(a+ | b+ | c+) (a+ | b+ | c+)

    < X >. (13)

    4.3 Further Examples of I-Regular ExpressionSoccer Game

    The video on the soccer game can be encoded into an I-string. Let an I-symbol a denote that a team A controls the ball, and b for a team B. If

  • A Time Interval String Model for Continuous Media 151

    neither controls the ball, an I-symbol c is used. A D-symbol g is used tomark the goal. We assume that the team which controls the ball just beforethe goal gains the point. For example, the code might be

    a8 b4 g c1a3 b2 c1 a7 g c1a4 b3 a5 b5 c1 . (14)

    Now, we show various queries are expressible in I-regular expression. We useU (a+ | b+ | c+) to make the expressions easy to understand. Show me the rst goal of the game.

    U g< X >

    (g | U) . (15)

    Show me the rst goal of the team A.

    U a+g< X >

    (g | U) . (16)

    Show me the second goal of the game with 15 s before the goal and 30 safter the goal. The requested range may be truncated if the goal is justafter the start of the game or just before the end of the game.

    UgU U(0, 0.25]

    g U(0, 0.5]

    < X >

    (g | U) . (17)

    Find the goal in the time-out extension.Equivalently in I-regular expression, we can say that nd the goal after45min. from the start of the game. This I-regular expression will matcheven if no such goal is present in the I-string, but nothing will be assignedto .

    (U | g)[45, 45]

    (U | g< X >

    ) . (18)

    Find two goals in less than 10min.

    (g | U) gUg(0, 10)

    (g | U) . (19)

    The match/fail corresponds to YES/NO for this query.

    Electrocardiographic Diagnosis

    The record of the electrocardiogram can be encoded into an I-string by D-symbols for notable peaks and an I-symbol v+ for the time ller (Fig. 5). Forexample, the code might be

    v200 p v89 q v50 r v23 s v270 t v180 p v90 q v57 r v19 s v260 t . (20)

  • 152 K. Nakayama et al.


    Fig. 5. An I-string example for ECG

    Here, we show that various conditions can be expressed by I-regular ex-pressions. Let U (p | q | r | s | t | ), V (U | v+), andR (p | q | s | t | ) (all D-symbols except r). Find the rapid heart beats.

    Equivalently in the I-regular expression, nd the portion that the timeinterval between successive rs is less than 400 ms.

    V r(R | v+)r(0, 400]

    V . (21)

    Find the portion of the heart failure.Equivalently in the I-regular expression, nd the portion that three R-to-R intervals which are at least 600ms long are followed by R-to-R intervalwhich is at most 400ms long.

    V r(R | v+)r(600,)

    (R | v+)r(600,)

    (R | v+)r(600,)

    (R | v+)r(0, 400]

    V . (22)

    4.4 Matching Preference Directive

    Selection from Multiple Solutions

    Now let us consider the situation that the constraints for the accepted path(see Sect. 5.2 for the denition) have uniquely undecidable multiple solutions.Since an I-symbol may be arbitrarily partitioned, and vice versa, the of I-regular expression matching may contain ambiguity in I-length. Onetypical example is the pattern v+v+. When this pattern matches an I-stringv5, the rst v+ can take arbitrary I-length between 0 and 5. Before extractingsubstrings, such I-lengths should be settled.

    For this purpose, optional directives long and short declare the pref-erence in the I-length. A sub-I-regular expression with long directive is

  • A Time Interval String Model for Continuous Media 153

    assigned the longest possible substring, and the same with short. For in-stance, when an I-regular expression




    matches an I-string v5, the latter part takes the priority of getting the longestpossible substring, v3, leaving the shortest substring, v2, for the former part.Let the I-length of v[2,) be x and the I-length of (v+) be y. The searchprocess generates the constraints that x + y = 5 and x 2 at its success.As we express our preference that y should be as long as possible, we havey = 3 and x = 2 as a solution. The shortest I-length without lower bound isimplementation dependent small value predened by user, say 0.0001. Whenan I-regular expression

    v+ v+long


    matches an I-string v5, the substrings for the former and latter parts wouldbe v0.0001 and v4.9999, respectively. For the input I-string wl and I-regularexpression w+, the sequence of transitions from the initial state to the nalstate may become arbitrarily long, because the time duration for each matchmay become arbitrarily small. In the next section we show more tractabledepth rst search model.

    5 I-String Recognition by I-AutomatonIn this section, a declarative denition of string acceptance/rejection by anNFA, in terms of a path, is given. The language dened by a conventionalregular expression can be recognized by a nondeterministic nite automaton(NFA). We adopt the same scheme for I-regular expression. An I-regular ex-pression is translated into an equivalent nondeterministic nite I-automatonI-NFA which is an extension of conventional NFA. I-string recognition isdone using this I-NFA. We rst review the conventional regular expressionand NFA, and then we will extend it to I-NFA.

    5.1 Conventional Nondeterministic Finite Automaton (NFA)

    The conventional nondeterministic nite automaton [6] is dened as

    (Q,, , q0, F ), (25)

    where Q is a set of states, is alphabet, : Q 2Q is a state transitionfunction, q0 Q is the initial state, F Q is the set of nal states. Anautomaton can change its state from q to q if q (q, s) by reading a symbol

  • 154 K. Nakayama et al.









    (r1 | r2)Let two I-automata share

    the state i of r1 and r2and f of r1 and r2

    fr1 r2i

    (r1r2)Let the state f of r1 be i of r2



    (r)Make new states i and f ,

    and add transition

    Fig. 6. Translation rules from a regular expression to an NFA

    s or without reading any symbol (transition by ). This is state transition,and the symbol s is transition symbol.

    For a given regular expression r, an NFA that recognizes the language L (r)can be obtained by recursively applying the rules shown in Fig. 6. In a diagramof NFA, each state is drawn as a small circle. We may draw its identier, forexample q3, within the circle when necessary. The initial state and the nalstates are drawn as double circles, labeled with i and f , respectively. Anarrow between two states represents possible state transitions dened by .Its transition symbol is labeled along each arrow. Enclosed region as r, r1, orr2 represents sub-NFA.

    The NFA produced by the translation from a conventional regularexpression r, recognizes L (r). This can be proved by the induction onthe construction of the regular expression.

    5.2 Conventional String Recognition by NFA

    For the conventional nite automaton, the acceptance/rejection of an inputstring is dened using path which is a sequence of state and transition. A pathis a track of state transitions on an NFA. A sequence is a path on an NFAi each state qi in the sequence, except for the rst one, is the result of thetransition function from the previous state with a transition symbol ti+1(qi+1 (qi, ti+1)). For example, the following is a path from state q0 to q5:

  • A Time Interval String Model for Continuous Media 155

    i 0 1 2 3 4 5State q0 q2 q1 q4 q2 q5Transition symbol s1 s2 s3 s4

    where each transition is dened like q2 (q0, s1), q1 (q2, s2), . . ..An input string s1s2 sm (si ) is accepted if there is a path that

    satises all the following conditions:

    1. The rst state q0 of the path is the initial state of the NFA.2. The last state of the path is in the set of nal states F (qn F ).3. The sequence of transition symbols is equivalent with the input string. In

    the above example, the input string s1s2s3s4 is equivalent to the sequenceof transition symbols s1s2s3s4, since means the empty string.

    5.3 Nondeterministic Finite I-Automaton (I-NFA)Now, we are ready to introduce I-automaton. The I-automaton has two ad-ditional constructs to the conventional nite automaton [2]. An I-automatonis dened as

    (Q, (, +), , q0, F, ), (26)

    where we use symbols in + as transition symbols in addition to {}. = {(qi, qf , Qi, ), . . .} is a set of I-constraints.

    To graphically represent an I-constraint, we draw a dotted box aroundQi, and place qi and qf on the border. By denition, if the I-automaton isnot trivial, qi has a transition from the outside of the dotted box, and qf hasa transition to the outside of the dotted box. So, we can distinguish qi and qfon the diagram. is placed just below the right bottom corner of the dottedbox. For example, in Fig. 8, the I-length from the state q2 to q3 should begreater than 0 and less or equal to 30.

    For a given I-regular expression r, an I-automaton which recognizes thelanguage L (r) can be obtained by recursively applying the rules shown inFig. 7. A translation example is shown in Fig. 8. The I-automaton producedby the translation from an I-regular expression r, recognizes L (r). This canbe proved by the induction on the construction of the I-regular expression.

    I-constraint (qi, qf , Qi, ) should satisfy the following conditions:

    +v fi

    v+ +Transition for I-symbol





    Fig. 7. Extended translation rules from I-regular expression I-automaton

  • 156 K. Nakayama et al.

    I-regular expression:v+ ( v+

    (0, 30]

    g v+(0, 10]



    (v+ | g)

    V+ V+ V+



    (0, 10]


    (0, 30][15, )

    q1 q3 q4 q6 q7 q8 q9q2 q5

    Fig. 8. I-regular expression and translated equivalent I-automaton

    qi, qf Qi, Qi Q and qi = qf . qi and qf are entrance and exit states,respectively.

    is a non-negative real interval and each end of the interval may be openor closed, independently.

    All transitions from the outside of Qi (that is, QQi) into Qi should beto the entrance state qi.

    All transitions from Qi to the outside (that is, QQi) should be from theexit state qf .

    For any (qi, qf , Qi, ) and (qi, qf , Qi, ) , if Qi Qi = then eitherQi Qi or Qi Qi holds.Accepting/Rejecting an I-StringAn I-path is a path with I-lengths for each transition. For example, thefollowing is an I-path:

    i 0 1 2 3 4 5State q0 q2 q1 q4 q2 q5Transition symbol ti v+ v+ m w+I-length(xi) 5.5 0.2 0 0 1.3

    For an I-automaton (Q,, , q0, F, ) where = {(qi, qf , Qi, ), . . .}, an in-put I-string is accepted by the I-automaton i there exists an I-path p thatsatises the following conditions; otherwise, it is rejected .

    The rst state of p is the initial state q0. qi+1 (qi, ti+1). The input I-string is symbol-equivalent to the sequence of transition

    symbols t1t2 tn. An I-string and a sequence of transition symbols aresymbol-equivalent i they become the same sequence by the following nor-malization: Replace all I-length with + from an I-string, resulting a sequence of

    symbols of +. For example, I-string v3.8v0.4mw2.1 becomes asequence v+v+mw+.

  • A Time Interval String Model for Continuous Media 157

    Replace successive occurrences of identical I-symbols, for instancev+v+, with one symbol v+ (and remove ) from both sequences. Thisis similar to the I-normal form for an I-string(see Sect. 3.2).

    For example, I-string v3.8v0.4mw2.1 is symbol-equivalent to a sequenceof transition symbols v+mw+w+, since both of them are normalizedto v+mw+. As another example, a7b3b5ma2 is symbol-equivalent toa+b+ma+.

    The last state of p is one of the nal states F . For any subsequence qjtj+1qj+1 tkqk, qj1 = qj , qk+1 = qk and an I-

    constraint (qj , qk, Qi, ), if qi1 Qi, qk+1 Qi then I(tj+1 tk) .

    6 I-String Recognition Algorithm6.1 Recognition by Depth-First Path Enumeration

    The state transition of conventional NFA is discrete. On the other hand,on I-NFA, in addition to discrete ones, I-length for each transition shouldbe taken into consideration. This means that the number of possible I-lengthpaths might be innite. In order to make the path search tractable, we intro-duced an algorithm in which each input symbol is handled by state transitions,and the I-length is handled by linear inequality constraints. Starting from theinitial state q0 of an NFA, the recognition algorithm repeats state transitionsin the depth-rst manner to enumerate symbol-equivalent paths, by read-ing each input symbol s1, s2, . . . , sm one by one. Each time the algorithmmakes a transition, the satisability of constraints on I-lengths are checked.If not satisable, the branch of symbol-equivalent path enumeration fails andbacktracks. When all the symbols are read, if the state comes to one of thenal states F , then the input I-string is accepted. If the input I-string is notaccepted in any nondeterministic choice of transition, the input I-string isrejected.

    In the algorithm, we use an extended I-path: some I-lengths can be left asvariables, and constraints on those variables may be written. We call this I-path with constraints. In the following, we assume that the input I-string isin a normal form. Thus, there is no possibility that the adjacent input symbolsare the same to each other.

    I-String Recognition AlgorithmWe assume the input I-string is s1s2 sm (si ) in the following.1. Initialization

    (a) Candidate path set: let the candidate path set be the -closure fromthe initial state q0, then choose one of them as the current path p.

    (b) The symbol in focus: let the symbol in focus be si (i = 1).

  • 158 K. Nakayama et al.

    (c) Constrained variables: let the current set of constraints be empty.Then, for each of I-symbol si, prepare a variable pi for representingthe length of that symbol, and add a constraint pi I(si). During thecourse of the execution of the algorithm, the algorithm adds anotherconstraint pi = I(si) after completing state transition for si, to makepi be exactly the same with the I-length of the symbol I(si).

    2. Symbol-equivalent path enumeration: for the symbol si, pick one possibletransition from the end of last p which is not yet tried.(a) If there remains no such transition, then backtrack.(b) If the new state is already visited and the solution of the current

    constraints is presumed by the previous constraints, then backtrack(See Sect. 6.2).

    3. I-length assignment for each transition: if the transition symbol of thetransition is an I-symbol, create a variable xj associated with the tran-sition for representing the possible I-length. Add a constraint that thevariable is positive, and is equal to or less than the I-length of the inputsymbol: 0 < xj I(si). Also, update the constraint for pi to accumulatethe I-length for symbol si so that: pi = (former pi) + xj .

    4. I-constraint handling:(a) For each I-constraint k , the I-length expression associated with the

    I-constraint is increased by the variable xj newly introduced for thetransition. If there is no solution to the constraints, then backtrack.

    (b) If the new state goes into an I-constraint k, then a variable z(1)kfor the I-length inside k is generated. If the same I-constraint isrevisited again, new variable, say z(2)k would be generated. Then, aconstraint z(1)k < upperBound(k) or z

    (1)k upperBound(k) is added

    depending on the specication of the upper bound of k. Check if thewhole constraint system can have any solution. If there is no solutionto the constraints, then backtrack.

    (c) If the new state goes out of an I-constraint k, then the constraintlowerBound(k) < z

    (1)k or lowerBound(k) z(1)k is added depending

    on the specication of the upper bound of k. Check if the wholeconstraint system can have any solution. If there is no solution to theconstraints, then backtrack.

    5. Advance the focus to the next input symbol si+1,(a) Acceptance/rejection check: if the new state is a nal state, and the

    input symbols are all read, and the constraints have a solution, thenterminate the execution and accept the input I-string.

    (b) If the previous symbol has been an I-symbol, add the constraint thatthe lower bound of the variable pi associated with the previous I-symbol be the I-length of the symbol: pi = I(si).

    If there is no solution to the constraints, then backtrack.6. Repeat the algorithm.

  • A Time Interval String Model for Continuous Media 159

    If backtracking from the initial state occurs, there is no more choice, so theinput I-string is rejected.

    6.2 Redundant Path Enumeration Cut-O

    The above algorithm enumerates paths, and checks whether each of them sat-isfy the constraints. At any point of path enumeration, the remaining behaviorfrom that point is determined by the state tuple of the algorithm:

    (si, qj , C(pi, zh1 , zh2 , . . .)), (27)

    where si is the symbol in focus, qj is the current state of the I-NFA,C(pi, zh1 , zh2 , . . .) is a set of constraints on pi (I-length assigned so far for si),and zh1 , zh2 , . . . (I-lengths assigned so far for currently open I-constraints).Since these I-constraints are open, qj should be inside of those ones. Thesearch in the algorithm will be redundant if the same or subsumed states (27)appear more than once.

    The simplest example is Fig. 9a. Suppose that the algorithm is processingthe second symbol s2 = v7.2 of I-normal form of an I-string w4v7.2, andthe transition for s2 started from state q1. When the algorithm comes toq1 q2 q3, the state tuple of the algorithm will be:

    (s2, q3, {0 < p2 I(v7.2) = 7.2}) (28)

    0 < p2 is implied by the denition of I-length of I-symbol, while p2 I(v7.2) = 7.2 is from the I-length of s2. No zhk appears in the constraints be-cause there is no I-constraint appearing in this I-NFA. When the algorithmadvances to the point of q1 q2 q3 q2 q3, the state tuple of thealgorithm would be the same with (28). So, the rest of the search from thelatter path will be cut-o since it is redundant.

    For I-NFA shown in Fig. 9b, the constraint on p2 seems to change forever,like:

    1 p2 4, 2 p2 8, 3 p2 12, . . . . (29)But, by taking the upper bound of I(v7.2) into the consideration, the algo-rithm would proceed as follows:

    q1V+q2 q3 q4



    [1, 4]

    q1 q2 q3 q4


    Fig. 9. Examples of redundant path enumeration cut-o (1)

  • 160 K. Nakayama et al.


    V+ V+


    [1, 4]

    [2, 3] (2, 5]

    (3, 4]

    q2 q3



    q7 q8

    q9 q10



    [5, 17)

    Fig. 10. Example of redundant path enumeration cut-o (2)

    q1 q2 q3 : (s2, q3, {1 p2 4}),q1 q2 q3 q2 q3 : (s2, q3, {2 p2 I(v7.2) = 7.2}),q1 q2 q3 q2 q3 q2 q3 : (s2, q3, {3 p2 I(v7.2) = 7.2}).

    (30)Since the range 3 p2 I(v7.2) = 7.2 is subsumed by the previous one2 p2 I(v7.2) = 7.2, the rest of the search will be cut-o.

    If state qj is within a still open I-constraint, the constraints on I-lengthfor each I-constraint should be taken into account, in addition to constraintson pi. This situation is illustrated by comparing two paths of the I-NFA inFig. 10. Both the upper path

    q1 q2 q3 q6 q7 q8 q11 (31)and the lower path

    q1 q4 q5 q6 q9 q10 q11 (32)have the same end state q11 which is in the still-open, outer I-constraint. Letx2, x4, x7, x9 be I-lengths assigned for q2 q3, q4 q5, q7 q8, q9 q10,respectively, and z be I-length assigned so far for the outer I-constraint. Theset of constraints for the upper path would be

    1 x2 4, 3 < x7 4, 0 < p2 I(s2), p2 = x2 + x7, z < 17, z = x7 (33)while that for the lower path would be

    2 x4 3, 2 < x9 5, 0 < p2 I(s2), p2 = x4+x9, z < 17, z = x9 . (34)These two paths are not regarded as redundant since the solution of (33) and(34) with respect to (p2, z) is not the same. Notice that if these constraintsare solved just with respect to p2, both solution will be 4 < p2 I(s2) = 7.2.

    The lower bound of the outer I-constraint 5 z is not included becausethe I-constraint is still open. This constraint will be added when the pathgoes out from it. Similarly, the constraint p2 = I(s2) will be added whenthe algorithm tries to proceed to the next symbol s3, that is, terminating theprocess for s2 at state q11.

  • A Time Interval String Model for Continuous Media 161

    Fig. 11. I-automaton for I-regular expression r

    6.3 I-String Recognition ExampleNow let us see the matching process for an I-regular expression against theI-string as shown below. Let an I-regular expression be

    v+ ( v+(0, 30]

    g v+(0, 10]

    ) (v+ | g) ,

    and an I-string be v50 g v100 g v200. First, we translate I-regular expressionto I-automaton shown in Fig. 11 (illustrated using simplied diagram). In thisI-automaton, some transitions are omitted for simplicity and the transitionby the symbol v+ with interval constraint (0, 30] is shown as if it were thetransition by the symbol v(0,30].

    The I-string recognition algorithm works for the input I-string as shownin Table 4. In the table, the execution advances from top to bottom. A symbolin focus is shown in the leftmost column. The chosen transition is shown inthe center column. Some transitions are combined into a single row in orderto save space. The variable over a transition arrow represents the I-lengthassociated with the transition. fail shows that there is no solution for theconstraints or no candidate for transition. Constraints are shown in the right-hand-side column. A new constraint is shown in the center of the column. Forreferring to the constraint, the number in the left of the column is used. Allthe constraints eective at the point of execution are shown by numbers inthe right of the column.

    7 Conclusion

    In this chapter, we modeled continuous media as I-string. As a pattern spec-ication language on the I-string we introduced I-regular expression, and asthe machine which recognizes the language, we introduced I-automaton.

    The I-regular expression provides a relatively simple but expressive meansto specify patterns on mixture of continuous/discrete notions. The continuousnotion (time duration) is handled by the constraint system, and the discretenotion (D-symbol and symbol character) is handled by the state machine. Thepreference on I-length enables the user control the matching preference.

    The limitation of the model is that for each positive interval, multipleattributes cannot be associated to the media explicitly. For example, if the

  • 162 K. Nakayama et al.

    Table 4. The search process for an I-automaton

    Input I-string Path Constraintss1 s2 s3 s4 s5 (the initial state) i pj I(pj)v50 g v100 g v200 (xk > 0, k = 1, 2, . . .)

    v50 p1 = ix1 1 (1) p1 = x1 = 50 {1}

    g p2 no candidate : fail (2) v50 p1 = i

    x2 1 x3 2 (3) p1 = x2 + x3 = 50 {3}0 < x3 30

    g p2 = 23 (4) {3}v100 p3 = 3

    x4 4 (5) p3 = x4 = 100 {3}0 < x4 10

    no solution : failv100 p3 = 3

    x5 4f (6) p3 = x5 = 100 {3}0 < x5 10

    no solution : failv100 p3 = 3

    x6 45 (7) p3 = x6 = 100 {3}0 < x6 10

    no solution : failv100 p3 = 3

    x7 45 x8 6 (8) p3 = x7 + x8 = 100 {3, 8}0 < x7 10

    g p4 no candidate : fail (9) {3}v100 p3 = 3

    x9 45 x10 6f (10) p3 = x9 + x10 = 100 {3, 10}0 < x9 10

    g p4 no candidate : fail (11) {3}v100 p3 = 3

    x11 45 x12 65 (12) p3 = x11 + x12 = 100 {3, 12}0 < x11 10

    g p4 = 56 (13) {3, 12}v200 p5 no candidate : fail (14) {3, 12}

    g p4 = 56f (15) {3, 12}v200 p5 no candidate : fail (16) {3, 12}

    g p4 = 565 (17) {3, 12}v200 p5 = 5

    x13 6 (18) p5 = x13 = 200 {3, 12, 18}no more symbol no nal state : fail (19) {3, 12}

    v200 p5 = 5x14 6f (20) p5 = x14 = 200 {3, 12, 20}

    no more symbol nal state : success (21) {3, 12, 20}Resulting path at success

    v50 p1 = ix2 1 x3 2 (3) p1 = x2 + x3 = 50

    0 < x3 30g p2 = 23 (4)

    v100 p3 = 3x11 45 x12 65 (12) p3 = x11 + x12 = 100

    0 < x11 10g p4 = 565 (17)

    v200 p5 = 5x14 6f (20) p5 = x14 = 200

  • A Time Interval String Model for Continuous Media 163

    rst 15min is drama and classic, then we associate it with an attributeclassic drama. The classic drama consists of classic and drama, andsuch an algebraic structure of attributes are discussed in [3,5,7] already, andomitted from our chapter.


    1. James F. Allen. Maintaining knowledge about temporal intervals. Communica-tions of the ACM, 26(11):832843, 1983.

    2. John E. Hopcroft and Jerey D. Ullman. Introduction to Automata Theory,Languages, and Computation. Addison-Wesley, Reading, MA, 1979.

    3. H. Khalfallah and A. Karmouch. An architecture and a data model for integratedmultimedia documents and presentational approach. ACM Multimedia Systems,3(516):238250, 1995.

    4. Ken Nakayama, Kazunori Yamaguchi, and Satoru Kawai. I-regular expression:regular expression with continuous interval constraints. In CIKM 97: Proceedingsof the sixth international conference on Information and knowledge management,pages 4050, New York, NY, USA, 1997. ACM.

    5. E. Oomoto and K. Tanaka. OVID: Design and implementation of a video-object database system. IEEE Transactions on Knowledge and Data Engineering,5(4):629643, 1993.

    6. D. Perrin. Finite automata. In J. van Leeuwen, editor, Handbook of TheoreticalComputer Science, Volume B: Formal Models and Semantics, pages 157. TheMIT/Elsevier, Cambridge, MA/Amsterdam, 1990.

    7. R. Weiss, A. Duda, and D. K. Giord. Composition and search with a videoalgebra. IEEE Multimedia, 2(1):1225, 1995.

    8. Gerhard A. Schloss and Michael J. Wynblatt. Providing denition and temporalstructure for multimedia data. ACM Multimedia Systems, 3(516):264277, 1995.

    9. Setrag Khoshaan and A. Brad Baker. MultiMedia and imaging databases, chap-ter 7.2, pages 333338. Morgan Kaufman, San Francisco, 1996.

  • Part III

    Computational Intelligence in Image/AudioProcessing

  • Noise Filtering of New Motion CaptureMarkers Using Modied K-Means

    J.C. Barca, G. Rumantir, and R. Li

    Department of Information Technology, Monash University, Melbourne, Australiajan.barca@infotech.monash.edu.au,

    Grace.Rumantir@infotech.monash.edu.au, Ray.Li@bigpond.net.au

    Summary. In this report a detailed description of a new set of multicolorIlluminated Contour-Based Markers, to be used for optical motion capture anda modified K-means algorithm, that can be used for filtering out noise in motioncapture data are presented. The new markers provide solutions to central prob-lems with current standard spherical flashing LED based markers. The modifiedK-means algorithm that can be used for removing noise in optical motion capturedata, is guided by constraints on the compactness and number of data points percluster. Experiments on the presented algorithm and findings in literature indicatethat this noise removing algorithm outperforms standard filtering algorithms suchas Mean and Median because it is capable of completely removing noise with bothSpike and Gaussian characteristics. The cleaned motion data can be used for accu-rate reconstruction of captured movements, which in turn can be compared to idealmodels such that ways of improving physical performance can be identified.

    1 Introduction

    This report is a part of a body of research that aims to develop an automatedintelligent personal assistant, which can facilitate classification of complexmovements and assist in goal-related movement enhancement endeavors. Theoverall research is divided into two major phases. The first of these two phasesaim to develop a personal assistant that will support athletes with improvingtheir physical performance. To construct this personal assistant a new cost-effective motion capture system, which overcomes the limitations in existingsystems and techniques that support intelligent motion capture recognition,must be developed. Phase two of the overall research focus on developinga physical prototype of the Multipresence system suggested by the authorin [1, 2]. This Multipresence system will be constructed in a way that allowsthe personal assistant to control it using intelligent motion capture recognitiontechniques.

    J.C. Barca et al.: Noise Filtering of New Motion Capture Markers Using Modied K-Means,

    Studies in Computational Intelligence (SCI) 96, 167189 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 168 J.C. Barca et al.

    The report will focus on the first phase of the overall research and tocomplete this phase a number of areas must be investigated. These areasinclude:

    1. Camera and volume calibration2. Construction of a new marker system, which does not have the limitations

    associated with Classical spherical ashing LED based markers.3. Motion data capturing and pre-processing4. Noise filtering5. Marker centre point estimation6. 2D to 3D conversion of marker coordinates7. Construction, fitting and temporal updating of skeleton8. Development of an intelligent motion recognition system

    A brief overview of general motion capture techniques is provided first,with the focus being on marker based optical motion capture. Proposed solu-tions to point two, three and four from the above list, will then be explainedin detail. As a response to point two, a new set of multicolor IlluminatedContour-Based Markers [3] is presented. A dimensionality reduction proce-dure, which simplifies the captured motion data, such that further processingbecomes less complex, is then proposed as a solution to step three. Finally amodified K-means algorithm [4], which can be used for inter-frame noise re-duction in images with optical motion capture data, is presented as a solutionto step four.

    1.1 Motion Capture

    Motion capture systems are tools for accurately capturing complex real worldmovements. Typically these captured movements are used in the movie, ani-mation and games industries where high quality representations of movementsare required in order to support suspension of disbelief. More recently motioncapture has also been used as a tool to aid in human motion analysis. Resultsfrom this kind of analysis can be used to identify ambiguities with the physicalperformance of athletes, or to assist in diagnosing people with illnesses thataffect their movement [5]. Some research also indicate that motion capturecan be used for controlling humanoid robots [6].

    There is a range of different motion capture technologies available. Thesetechnologies span from optical, magnetic, mechanical, structured light, radiofrequency and acoustic systems to wearable resistive strips and inertial sens-ing systems, or combinations of the above [7]. All these technologies havevarying degrees of drawbacks. Optical, acoustic and structured light systemssuffer from occlusion problems, magnetic and radio frequency trackers sufferfrom noise and echo problems, mechanical systems have a non-user friendlyinterface that undermines emersion, inertial sensors suffer from bias and drifterrors, while resistive strips must be built into a body suit, which makes them

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 169

    difficult to calibrate for different users [710]. Another drawback with many ofthe abovementioned systems is also that they are high-end and therefore quiteexpensive, which makes it hard for many individuals and small companies toacquire the necessary technology [5, 11].

    The optical approach to motion capture has been selected for this research.The reason for this is that the intension is to capture movements in controlledenvironments and the occlusion problems usually associated with the opticalapproach therefore will be limited. Other reasons for choosing this approachare that this class of systems has proved to support accurate capturing, haveonly limited noise problems, do not suffer from echo problems and there iscost effective ways to construct these systems. Optical systems can also easilybe designed in a way that does not limit the users freedom of movement.Another important factor for selecting the optical approach is that capturingcan be performed in real-time.

    1.2 Optical Motion Capture

    What systems that use the optical approach to motion capture have in com-mon is that they use cameras as sensors. In general, this class of systemscan be divided into two sub categories. These two categories are referred toas marker-less and marker-based approaches to optical motion capture. Thisresearch will focus on marker-based approaches, because currently only thesecan track complex and detailed motions effectively enough to support real-time processing [11].

    1.3 Marker-Based Tracking

    In early motion capture systems, most contour points of the tracked subjectwere suppressed in order to achieve real-time processing. The points thatwhere not suppressed where referred to as markers [12]. Today, in order toqualify as a marker, an object must contain two pieces of information: what theobject is in relation to the current process and where this object is located [13].Currently there are two main types of markers: Passive and Active. Both thesemarker types will be described briefly below.

    Passive Markers

    The characteristics of passive marker systems are that the markers must bemanually identified. A Classical passive system is constructed of spheres thatare 2.5 cm in diameter and are covered with a highly reflective material thatoften is over two thousand times brighter than a normal white surface [14]. Thematerial covering the marker reflects light (in many cases infrared) projectedfrom light sources positioned around the lens of each camera. These reflections

  • 170 J.C. Barca et al.

    Fig. 1. A classical spherical marker [15]

    give the markers a distinctive color compared to the rest of the image andtherefore support marker extraction. A Classical passive marker is shown inFig. 1.

    The main drawback with passive systems is that either a trained humanoperator or a specific start-up pose of the performer is required for identifyingthe makers. A second drawback is that even if all markers have been correctlyidentified initially, their ID will be lost after an occlusion. As a result of this, itseems like a new unknown marker emerge when a occluded marker reappears[16]. These occlusions can in addition to contributing to the generation of falsemarkers, create holes in incoming data streams [1720].

    Active Markers

    What active marker systems have in common, is that they express sufficientinformation to support automatic marker identification. There are severalvariations of active marker systems such as the square markers presented by[21] and the retro reflective mesh based markers presented by [6], but the mostcommonly used active marker is constructed of sets of spherical flashing lightemitting diodes (LEDs). Each of the LEDs in these commonly used markersare wired to an external computer, which provides them with distinctive flashsequences that allows each marker to communicate their ID automatically.The computer also ensures that the markers flash in synchronization withthe digital shutters of the capturing cameras [16,22,23].

    A drawback with Classical spherical LED based active markers is thatmore than one image must be analyzed in order to identify each marker,which makes the processing time longer than if methods that support moredirect identification was used. One such direct method is to use static colors toexpress IDs rather than flash sequences. The problem here is that colors tendto change when they are exposed to different lightning [24]. Knowledge aboutthe motion of tracked markers has therefore been used to support the color cue,but there are difficulties associated with this approach as well. This because ofsevere discontinuities in human motion and delay in frame processing [25,26].

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 171

    A second problem with the flashing LED type active markers is that thewires that run from markers to the computer restrict the users freedom ofmovement [22, 23]. The result is that captured movement in some situationscan appear un-natural and that the tracking process may be too cumbersomefor use in some applications, especially in medical applications where usersmay have some kind of movement disability. Both the initial and the latterare highly undesirable. The initial because a tracking system that in any waymakes the movement appear un-natural undermines one of the central aimsof motion capture, which is to capture realistic movement (this is also thedrawback with the constraints posed on the users of the markers presentedby [24]). The latter is undesirable because a system design that makes thetracking process cumbersome prevents a range of people from benefiting fromthe technology. A third drawback with using flashing spherical LED typemarkers is as with spherical passive markers, that they easily create holes inincoming data streams as results of occlusions.

    1.4 Proposed Solution to Drawbacks with Classical Markers

    To solve and/or reduce the abovementioned drawbacks with current markersystems, the researcher propose the set of active multicolor IlluminatedSegment-Based Markers described by the author in [3]. These markers ex-press their identity using different pairs of long intersecting line segments withinternally produced static colors. These colors are illuminated into the envi-ronment and are therefore more robust towards changes in external lightingthan colors produced by reflected light. This way of solving the identificationproblem gives the markers an edge over Classical spherical LED based activemarkers. This because static color cues allow markers to be identified withinone single image, rather than trough a sequence of images and therefore al-lows for a reduction of processing time. The use of static colors also eliminatesthe need for wiring markers to a complex external computer, removing therestrictions usually posed on user movement by Classical flashing LED basedmarker systems. Another central strength of the Illuminated Segment-BasedMarkers is that they support more robust estimation of missing data, thantraditional markers. This because the proposed markers allow for both intra-frame interpolation of missing data and inter-frame estimation of occludedintermediate sections of line segments. This strength is highlighted by the factthat the Illuminated Segment-Based Markers are designed to be larger thantraditional markers and therefore have a greater chance of retaining enoughdata to estimate missing marker positions inter-frame than Classical mark-ers. This in turn results in a reduced chance of having to assume intra-framelinearity in the case of occlusions.

    Design specifics and results from experiments on the Illuminated Segment-Based Markers are described in greater length in Sects. 2 and 3.

  • 172 J.C. Barca et al.

    1.5 Characteristics of Optical Motion Capture Data

    High dimensionality and noise is naturally embedded in time series data andmakes it a challenging task to process sequences of motion data [27]. To solvethis problem in an effective way, initial processing should involve a dimen-sionality reduction procedure, which simplifies the data. Such reductions aretypically performed by flattening regions where data only varies gradually,or not at all [2830]. Noise can in general be referred to as any entity thatis uninteresting for achieving the main goal of the computation [31]. Theseuninteresting entities can be introduced to optical motion data as a result ofthe constant fluctuation of light, interference of background objects, externalartifacts that corrupts the analogue-to-digital conversion process, accuracylimitations of sensors or transmission errors [7,8,31]. It is important to noticethat some types of noise may be invisible initially, but can be accumulatedover time, resulting in increased data complexity and/or data being incor-rectly classified [7, 28, 32, 33]. To avoid this, one should aim to exclude asmuch noise from the data as possible before main processing is initiated. Toremove noise most effectively, one should investigate where it originates fromand analyze its characteristics so that knowledge obtained from this process,can be used for designing an suitable filtering algorithm for the noise at hand.

    2 Experimental Design

    In this section, we will describe the strengths of the Illuminated Contour-Based Marker System and explain how these are assembled. Then a descrip-tion of the nature of the captured data and an outline of how data is capturedand pre-processed is provided. At the end of the section, we present a detaileddesign overview of the proposed Modified K-means algorithm, which is usedfor removing inter-frame noise in optical motion capture data.

    2.1 The Illuminated Contour-Based Marker System

    The Illuminated Contour-Based Markers are constructed of intersecting pairsof 3mm thick battery powered, flexible glow wires of different colors. Theseglow wires are made of copper wires with phosphorus powder coatings andare protected by insulating plastic in different colors. The wires operate onalternating currents using a small battery driven inverter. When a current istransmitted trough a wire the phosphorus produce an illuminating electrolu-minescent glow [34]. The appearance of this glow depends on the color of theinsulating plastic layer covering the wires. Ten different types of glow wire areavailable on the market to day. A glow wire can be observed in Fig. 2. Theglow wires are cut into appropriate lengths, and pairs of wires with differentcolors are assembled into markers in such a way that the two wires intersect

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 173

    Fig. 2. Glow wire

    Fig. 3. The Illuminated Contour-Based Markers. Each pair of line segments illumi-nates a set of distinctive colours

    and each marker is identifiable by its distinctive color combination. The in-tersection between wires is regarded as being the marker midpoint. Sets ofIlluminated Contour-Based Markers are shown in Fig. 3.

    2.2 The Body Suit

    The assembled markers are attached to a body suit to be worn by the subjectto be tracked during the motion capture procedure. In order for this body suitto support realistic and accurate tracking it requires some essential charac-teristics. First, it must not restrict the users freedom of movement. Secondly,it is important that the material the bodysuit is constructed of is able toclosely follow the movement of the tracked body and stay in place as the skin

  • 174 J.C. Barca et al.

    Fig. 4. A prototype of the bodysuit with Illuminated Contour-Based Markersattached

    moves underneath [35]. After experimenting with different types of materialsand suit designs, the researcher found that tight sitting, lightweight thermalunderwear and socks have the above mentioned qualities

    As the body suit needs to be washed after being used, the markers aredesigned to be temporarily attached to the suit using Velcro instead of beingpermanently attached. As such, strips of Velcro patches were glued to the suitat key positions so that the markers can be attached to them (how these keypositions is selected in described below).

    In order to allow for adjustments of the suit so that it could be accom-modated for small variations in body size and shape, these patches of Velcrowhere made long enough to allow for fine tuning of marker positions. Thecomplete bodysuit can be observed in Fig. 4.

    A small battery driven inverter that supplies the markers with electricity,is placed on the lower back region of the body suit. This location has beenselected as it attributes minimal interference with the users body movement.

    2.3 Marker Placement

    To support a motion capturing process with minimum interference of noise, itis important to identify positions on the tracked body, which are suitable formarker attachment. These key positions should allow the markers to remainin stable relationships with the underlying skeleton as the body moves. Onething that can affect this relationship is secondary motions in soft body tissue[14,36]. In order to avoid capturing these secondary movements, the researcher

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 175

    Fig. 5. Virtual skeleton rigged with Illuminated Contour-Based Markers

    has chosen to place the markers on areas of the body where the skin is closeto the bone (e.g. elbows, knees and wrist). Figure 5 shows a virtual skeletonrigged with a set of Illuminated Contour-Based Markers.

    2.4 The Motion Capture Images

    Series of images has been captured of an articulated human body rigged withthe Illuminated Contour-Based Markers. These images have an identical sizeof 720576 pixels and the color space used is RGB. All images where capturedusing four different calibrated cameras, placed in a circle around the captur-ing volume. More colors appear in the captured images, than those used inthe original Illuminated Contour-Based Markers System as a result of smalldifferences in sensing devises within and across these cameras. This resultsin an excessive image complexity, which contributes to increasing process-ing time. To solve this problem each image is pre-processed (as explained inSect. 2.5).

    As image features change over time and across capturing devices and toensure that the proposed system is able to process all features correctly, imagesused in experiments have been selected randomly across both cameras andtime steps.

    2.5 Data Pre-Processing

    To reduce the complexity of captured images, un-interesting image compo-nents are filtered out as background in pre-processing using a thresholdingtechnique. Data that is valid for the main processing is compressed into anumber of flat colour regions, corresponding to the number of colours usedin the marker system. Tolerance values for each of these regions have beendetermined through multiple trial and error experiments.

  • 176 J.C. Barca et al.

    2.6 Modied K-Means Algorithm for Noise Filtering

    When data has been pre-processed, the Modified K-means algorithm is used toclean up noise embedded in each image by creating clusters of pixels based ontheir relative spatial positions in the image. Following the classical K-meansalgorithm [27,28,3743] the Euclidean Distance measure shown in (1), is usedto determine which cluster a pixel belongs to. Each pixel is put into a cluster,which yields the minimum Euclidean Distance between the pixel and therespective centroid. The centroid of each cluster is changed iteratively bycalculating its new coordinate as the average of the sum of the coordinates ofthe pixels in the cluster until it converges to a stable coordinate with a stableset of member pixels in the cluster. For each iteration, the memberships ofeach cluster keep changing depending on the result of the Euclidean Distancecalculation of each pixel against the new centroid coordinates

    dic =

    (xi xc)2 + (yi yc)2, (1)where:

    dic : the Euclidean distance between pixel i and a centroid cxi, yi : the 2D coordinate of pixel ixc, yc : the 2D coordinate of centroid c

    The modifications to the classical K-means algorithm lie in the definitionof a data vis-a-vis noise cluster and the automation of the determination ofthe optimum number of clusters an image should have. A cluster is consid-ered noise if it only has a few pixels in it. The minimum number of pixels in acluster, or the cluster size, should be set such that it minimizes the degree offalse positives (i.e. data clusters incorrectly classified as noise) and false neg-atives (i.e. noise clusters incorrectly classified as data). The minimum clustersize is domain specific and is determined by observing the number of datapoints usually found in a noise cluster for the type of data at hand. In thisexperiment, the minimum number of pixels in a cluster is set to 4 after a fewtrial and error processes.

    The compactness of a cluster is used to determine the optimum number ofclusters for a given image. In this paper, the degree of compactness of a clusteris defined as the number of pixels occupying the region of a rectangle formedby the pixels located at the outer most positions of the cluster (i.e. the pixelsthat have the maximum and minimum X and Y coordinates respectively). Acluster that has a lower degree of compactness than the specified value willbe split further. In this experiment, the degree of compactness used is 20%,which is a value just below the minimum compactness of valid data clustersfor the observed domain.

    The modified K-means algorithm performs local search using randomlygenerated initial centroid positions. It is a known problem that the determi-nation of the initial centroid positions plays a big part in the resulting clusters

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 177

    and their compositions [29,38,4447]. In order to reduce this problem and tomake the search mechanism a bit more exhaustive, ten clustering exercises us-ing ten different initial centroid positions are performed for each image. Theresult of the exercise that produces clusters with the maximum total degreeof compactness will be selected. If a set of data cannot be separated linearlywe discard the run and initiate the algorithm again with different initial clus-ter positions. The processed data is finally plotted, in order to allow for easyinspection of results.

    A detailed overview of the Modified K-means algorithm is presented inTable 1.

    Table 1. Modified K-means algorithm for noise reduction in optical motioncapture data

    Procedure: modified K-means algorithm for noise reduction in optical motioncapture data

    Set minimum number of data points per cluster // cluster size constraintSet minimum cluster compactness // cluster compactness constraint

    For a set number of experiments do

    Set initial cluster centroids

    Set interationFlag to yesWhile iterationFlag = yes do

    Set iterationFlag to no

    // Basic K-meansRepeatCalculate the distance between data points and each cluster centroidsAssign each data point to clusterCalculate the new cluster centroids

    Until all clusters have converged

    // Filter clusters based on minimum cluster size constraintFor each clusterIf cluster has too few data points thenDelete cluster

    End ifEnd For

    // Filter clusters based on cluster compactness constraint

    For each cluster// Find corners of compactness window


  • 178 J.C. Barca et al.

    Table 1. (Continued)

    Find data points with minimum and maximum X valuesFind data points with minimum and maximum Y valuesDefine cluster compactness window size

    Calculate the number of data points in clusterCalculate cluster compactness = number of data points / compactnesswindow size

    If cluster compactness < minimum compactness thenSplit cluster into twoSet iterationFlag to yes

    ElseRecord cluster compactnessRemove cluster and content from analysis

    End ifEnd For

    If iterationFlag = no thenCalculate the average compactness of all clusters in the experiment

    End ifEnd while

    End For

    Select set of clusters from experiment with the highest average compactnessEnd Procedure

    3 Experiment Results

    In this section we present results of experiments on pre-processing and intra-frame noise filtering in images captured from an articulated human bodyrigged with sets of Illuminated Contour-Based Markers.

    3.1 Recognizing Coloured Line Segments

    At present we have separated five of the ten different types of glow wiresavailable on the market into distinct flat color regions in pre-processing, al-lowing ten different markers to be constructed. These recognized wires areclassified as: Red, Orange, Green, Purple and Blue. Each of the remainingfive wires appears to have color attributes, which are so similar to a numberof the remaining nine, that they are hard to separate from the others. Theseparation problem is a result of sensing devices across cameras being slightlydifferent because this makes it necessary to employ an un-naturally wide color

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 179

    threshold for each color, in order to support successful classification acrosscameras. This in turn makes the color space pre-maturely crowded leaving noroom for the remaining five unclassied line segments.

    3.2 Noise Filtering

    Five types of experiments have been performed on the Modified K-means al-gorithm. The first experiment tests the algorithms ability to remove syntheticspike noise from raw motion capture images. The second aims to find the al-gorithms tolerated spike noise level. This is done by introducing images withdifferent levels of real spike noise to the algorithm and analyzing the output.The third, tests how well the algorithm deals with real noise that has differentGaussian blur radii. This experiment is conducted in order to estimate the al-gorithms ability to remove noise with different Gaussian characteristics. Thefourth type of experiment is a set of comparisons between a commerciallyavailable Median filter [48], which is used for reducing noise in images andthe proposed modified K-means algorithm [4]. Finally it is shown that theproposed modified K-means algorithm also can be used to remove noise inimages with Classical spherical markers.

    Removing Synthetic and Real Spike Noise

    In the first experiment an image with spurious artificial spike noise has beencleaned. The result of this experiment can be observed in Fig. 6, where thenoisy image is represented in the top (noisy pixels are encircled) and thecleaned version in the bottom. Here the white pixels represent the backgroundwhile the black pixels represent the components of the Illuminated Contour-Based Markers System and noise.

    Three of the images used in the second experiment, which involves findingthe Modified K-means algorithms spike noise level tolerance is shown in Fig. 7.Here the leftmost image has 0%, the middle 8% and the rightmost 16% realspike noise (image contrast is increased in order to allow for easy inspection).

    Figure 8 shows the results of the experiment on real spike noise. Thenumber of cleaned data points is displayed vertically, while the noise level isdisplayed horizontally in percentage. One can here observe that more thanfifty percent of the original data points still are classified correctly at a noiselevel of 8%, while the algorithm still proved to effectively remove noise inimages with noise levels up to 12%.

    Removing Gaussian Noise

    In this experiment, Gaussian blur with varying radii is introduced to severalcopies of the noisy image in the top of Fig. 6, before the Modied K-meansalgorithm is used to clean the images. In Fig. 9, three of the processed imagesare presented (the leftmost image has a Gaussian blur pixel radius of 0, themiddle a radius of 2, and the rightmost 4).

  • 180 J.C. Barca et al.

    Fig. 6. Top: A pre-processed motion capture image and noise in the form of irregularlighting can be observed. Bottom: The resulting cleaned image with noise removed

    Fig. 7. Images with Illuminated Contour-Based Markers and Spike noise of 0%, 8%and 16%

    Figure 10 shows how much data that can be recaptured after noise withGaussian characteristics has been removed. One can here observe that thenumber of data points recaptured naturally decreases as the radius of theGaussian blur increases. However, it is also shown that the degradation ofperformance occurs gradually, as oppose to abruptly when the radius is in-creased up to 2.5 pixels. For this reason, it can be concluded that the modified

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 181

    Fig. 8. Results from experiment on images with Illuminated Contour-Based Markersand Spike noise of 0, 4, 8 and 12%

    Fig. 9. Flattened images with Gaussian blur of 0, 2 and 4 pixels in radius beforenoise is removed

    Fig. 10. Cleaned data points recaptured after the removal of Gaussian blur noisewith varying radii using the Modified K-means

    K-means is capable of removing noise with Gaussian characteristics whilekeeping false positives to the minimum. This result is better than the perfor-mance of the Mean and Median filters that are well known to only suppress(i.e. reduce) Gaussian noise rather than remove it [31].

  • 182 J.C. Barca et al.

    3.3 Comparisons: Modified K-Means vs. Median Filter

    Two types of comparisons have been conducted and both of these have beenbetween a commercially available Median filter [48] that is used for reducingnoise in images and the proposed modified K-means algorithm [4].

    Spike Noise Removal Comparisons

    In Fig. 11 one can observe results of an experiment where the two algorithmsability to remove spike noise is analyzed. The level of Spike noise is incremen-tally increased with 4% across four runs, starting at 0. The ideal number ofdata points after noise ltering is 747. All data is initially pre-processed. Onecan observe that the number of recaptured data points is lower for the Me-dian lter in all test runs. This indicates that the modied K-means algorithmremoves spike noise with a lower number of false positives than the Medianlter. This indication is veried in Fig. 12, where the number of false positives

    Fig. 11. Recaptured data after Spike noise ltering

    Fig. 12. Number of false positives in Spike noise experiments

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 183

    across the same four runs in presented. One can observe that there are strongcorrelations between the increasing number of false positives and the level ofSpike noise. The number of false negatives was at zero across all runs.

    Gaussian Noise Removal Comparisons

    In Fig. 13 results from an experiment on a series of motion capture images withnoise and increasing levels of Gaussian blur is presented. The Gaussian blurpixel radius is increased incrementally with 0, 5 pixel across 8 runs, startingat 0 pixel radius. One can observe that there are close correlations betweenthe performance of the modified K-means algorithm and the Median filter asthe blur levels increase. One can also observe that the number of correctlyrecaptured clean data points decrease gradually as the Gaussian blur radiusincrease.

    Figure 14 show how the number of false positives increase as the Gaussianblur pixel radius becomes greater. One can observe that there are strongcorrelations between results from the modified K-means algorithm and theMedian filter also here. The number false positives is here, still below fiftypercent of the total number of data points when the Gaussian blur pixelradius is at 2 pixels.

    In Fig. 15 one can observe the number of false negatives in the same ex-periments as above. One can observe that the number of false negatives peakat 0.5 Gaussian blur pixel radius for both the Median filter and the modifiedK-means algorithm. This peak is at the same point where the number of falsepositives is at its lowest.

    Fig. 13. Number of recaptured data points after images with noise and varyinglevels of Gaussian blur have been cleaned

  • 184 J.C. Barca et al.

    Fig. 14. Number of false positives as the Gaussian blur radius increases

    Fig. 15. Number of false negatives as the level of Gaussian blur increases

    3.4 Removing Noise in Images with Spherical Markers

    The Modified K-means algorithm has also been tested on images with syn-thetic Classical ball style markers, these experiments show that the proposedalgorithm also is capable of cleaning this type of data. An illustration of onethe results are given in Fig. 16, where the original image is presented to theleft and the processed image to the right.

    3.5 Processing Time

    It is important to notice that processing time increases with each additionalcluster centroid needed to analyze a dataset. Experiments show that if thelevel of noise is at 16% and above (this number is dependent on the colorcomposition of the noise at hand and the threshold values set for each markercomponent in pre-segmentation), the calculation time becomes so great (whenusing one Pentium 4 processor) that the noise cleaning becomes impractical.

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 185

    Fig. 16. Left: A raw image generated from a synthetic ball marker. Right: Theimage with noise removed

    This problem can be dealt with in three ways. The first is to ensure that cap-turing sensors and tools used for data transfer support lowest possible inter-ference of noise. The second method, which only partially solves the problem,is to increase the value for the minimum number of data points per clusterconstraint, such that more noisy data points can be removed from the datasetusing a smaller number of cluster centroids. Here, it is important to noticethat when the constraint value becomes greater than the number of datapoints usually clustered together in valid data, the number of false positiveswill increase. The third method for solving the problem would be to increaseprocessing power.

    4 Conclusion

    A set of Illuminated Contour-Based Markers for optical motion capture hasbeen presented along with a modified K-means algorithm that can be usedfor removing inter-frame noise. The new markers appear to have features thatsolve and/or reduce several of the drawbacks associated with other markersystems currently available for optical motion capture. Some of these featuresare:

    Missing data can be estimated both inter-frame and intra-frame, whichreduces the chances of complete marker occlusions without increasing thenumber of cameras used.

    System is robust toward changes in external lighting compared to markersthat do not produce its own internal light.

    Markers can be automatically identified in one single image. Eliminates the need for synchronizing camera shutters with flashing from

    markers and therefore allows for tracking without wiring the markers to acomplex computer.

    Has the potential to generate more markers than systems, which use onlyone single color for marker identification.

    In the modified K-means algorithm, the modifications to the ClassicalK-means algorithm are in the form of constraints on the compactness and

  • 186 J.C. Barca et al.

    the number of data points per cluster. Here clusters with a small numberof data points are regarded as noise, while sparse clusters are split further.The value for the minimum number of data points per cluster constraint isdomain specific and is determined by observing the number of data pointsusually found in a noise cluster for the type of data at hand. The value forthe minimum compactness constraint should be set just below the minimumcompactness of valid data clusters for the domain. Several experiments havebeen conducted on the noise filtering algorithm and these show that flatteningthe images into six color regions in the data pre-processing stage assists furtherprocessing by reducing the number of dimensions the algorithm must cluster.Experiments also indicate that the modified K-means algorithm:

    Manage to clean artificial and real spike noise in motion capture imageswith Illuminated Contour-Based Markers or Classical spherical markerswhen the signal to noise ratio is up to 12%.

    Is capable of completely removing Gaussian noise with a gradually increasein false positives as the radius increases. This is a better result than thatproduced by traditional Median and Mean filters.

    Reduces Spike noise in images with Illuminated Contour-Based Markers ina way that results in less false positives than the Median filter is capable of.

    Reduces Gaussian blur in images with Illuminated Contour-Based Markerswith similar number of false positives as the Median filter.

    5 Future Work

    A suitable algorithm for automatic marker midpoint estimation is currentlybeing constructed. When a complete set of experiment have been conducted,future research will involve investigating a color calibration method, whichaims to synchronize the input from capturing cameras. This in order to al-low more markers with distinctive color combinations to be generated. Thiscalibration procedure will involve comparing the color values being registeredfor the same object across cameras. Trough the use of knowledge obtainedtrough these comparisons, a correction matrix that can be used for guidingthe synchronization of input from different cameras, can be generated. Thissynchronization process may in turn allow for smaller regions of the colorspace to be assigned for classification of each marker component, resulting ina less crowded color space. This optimized use of color space may make roomfor new distinctive regions within the color space, which can be used for clas-sifying more of the ten glow wires currently available on the market. It mayalso prove fruitful to research into the use of a color space that has a separatechannel for luminosity, (such as Absolute RGB or HSV) so that luminosityinformation can be removed from further analysis. The benefit would be thatthe color values registered for each glow wire would be more stable as the dis-tance between wires and cameras change. This may in turn allow for smaller

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 187

    regions of the color space to be associated with each wire, allowing furtheroptimization of the color space separation.

    When the above is completed, the research focus will be on investigatingmethods that allow for automatic 2D to 3D conversion of marker coordinates.This before focus is shifted onto researching and implementing techniques thatallows a virtual skeleton to be fitted to incoming motion data and tracked overtime. Finally, ideal motion models will be captured and the intelligent motionrecognition system designed, before the second major research phase, (whichinvolves constructing the Multipresence system described by the author in[1, 2]) is initiated.


    1. Barca J C, Li R (2006) Augmenting the Human Entity through Man/MachineCollaboration. In: IEEE International Conference on Computational Cybernet-ics. Tallinn, pp 6974

    2. Barca J C, Rumantir G, Li R (2008) A Concept for Optimizing BehaviouralEectiveness & Eciency. In: Machado T, Patkai B, Rudas J (eds) IntelligentEngineering Systems and Computational Cybernetics. Berlin Heidelberg NewYork, Springer, pp 477486

    3. Barca J C, Rumantir G, Li R (2006) A New Illuminated Contour-Based MarkerSystem for Optical Motion Capture. In: IEEE Innovations in Information Tech-nology. Dubai, pp 15

    4. Barca J C, Rumantir G (2007) A Modied Kmeans Algorithm for Noise Re-duction in Optical Motion Capture Data. In: 6th IEEE International Conferenceon Computer and Information Science. Melbourne, pp 118122

    5. Jobbagy A, Komjathi L, Furnee E, Harcos P (2000) Movement Analysis ofParkinsonians. In: 22nd Annual EMBS International Conference. Chicago, pp821824

    6. Tanie H, Yamane K, Nakamura Y (2005) High Marker Density Motion Captureby Retroreflective Mesh Suit. In: IEEE International Conference on Roboticsand Automation. Barcelona, pp 28842889

    7. Bachmann E (2000) Inertial and Magnetic Tracking of Limb Segment Orien-tation for Inserting Humans into Synthetic Environments. PhD thesis, Navalpostgraduate school

    8. Clarke A, Wang X (1998) Extracting High precision information from CCDimages. In: Optical Methods and Data Processing for Heat and Fluid Flow.City University, pp 111

    9. Owen S (1999) A practical Approach to Motion Capture: Acclaims opticalmotion capture system. Retrieved Oct 2, 2005. Available at www.siggraph.org/education/materials/HyperGraph/animation/character animation/motioncapture/motion optical

    10. Sabel J (1996) Optical 3d Motion Measurement. In: IEEE Instrumentation andMeasurement Technology. Brussels, pp 367370

    11. Oshita M (2006) Motion-Capture-Based Avatar Control Framework in Third-Person View Virtual Environments. In: ACM SIGCHI International Conferenceon Advantages in Computer Entertainment Technology ACE06. New York

  • 188 J.C. Barca et al.

    12. Furnee E (1988) Motion Analysis by TV-Based Coordinate Computing in RealTime. In: IEEE Engineering in Medicine and Biology Societys 10th AnnualInternational Conference. p 656

    13. Bogart J (2000) Motion Analysis Technologies. In: Pediatric Gait. A new Mil-lennium in Clinical Care and Motion Analysis Technology. pp 166172

    14. Shaid S, Tumer T, Guler C (2001) Marker Detection and Trajectory GenerationAlgorithms for a Multicamera based Gait Analysis System. In: Mechatronics11: 409437

    15. LeTournau University (2005) LeTournau University. Retrieved Nov 30, 2005.Available at www.letu.edu

    16. Kirk G, OBrien F, Forsyth A (2005) Skeletal Parameter Estimation from Opti-cal Motion Capture Data. In: IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR05). pp 782788

    17. Brill F, Worthy M, Olson T (1995) Markers Elucidated and Applied in Local3-Space. In: International Symposium on Computer Vision. p 49

    18. Wren C, Azarbayejani A, Darrel T, Pentland A (1997) Pnder: Real-time Track-ing of the Human Body. In: IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 19: 780785

    19. Arizona State University (2006) Arizona State University. Retrieved Apr 27,2006. Available at www.asu.edu

    20. Ringer M, Durmond T, Lasenby J (2001) Using Occlusions to aid PositionEstimation for Visual Motion Capture. In: IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR01). pp 464469

    21. Kawano T, Ban Y, Uehara K (2003) A Coded Visual Marker for Video TrackingSystem Based on Structured Image Analysis. In: 2nd IEEE and ACM Interna-tional Symposium on Mixed and Augmented reality. Washington, p 262

    22. Fioretti S, Leo T, Pisani E, Corradini L (1990) A Computer Aided MovementAnalysis System. In: IEEE Transaction on Biomedical Engineering 37: 812891

    23. Tekla P (1990) Biomechanically Engineered Athletes. In: IEEE Spectrum 27:4344

    24. Zhuang Y, Zhu Q, Pan Y (2000) Hierarchical Model Based Human MotionTracking. In: International Conference on Image Processing. Vancouver, pp8689

    25. Kang J, Cohen I, Medoni G (2003) Continuous Tracking Within and AcrossCamera Streams. In: IEEE Conference on Computer Vision and Pattern. Wis-counsin, pp 267272

    26. Sherrah J, Gong S (2000) Tracking Body Parts using Probabilistic Reasoning.In: 6th European Conference on Computer Vision. Dublin

    27. Tatsunokuchi, Ishikawa, Minyang, Sichuan (2004) An evolutionary K-meansalgorithm for clustering time series data. In: 3rd International Conference onMachine Learning and Cybernetics. Shanghai, pp 12821287

    28. Chen H, Kasilingam D (1999) K-Means Classication Filter for SpeckleRemoval in Radar Images. In: Geoscience and Remote Sensing Symposium.Hamburg, pp 12441246

    29. Lee H, Younan H (2000) An Investigation into Unsupervised Clustering Tech-niques. In: IEEE SoutheastCon. Nashville, pp 124130

    30. Pham L (2002) Edge-adaptive Clustering for Unsupervised Image Segmenta-tion. In: International Conference on Image Processing. Vancouver, pp 816819

    31. Trucco E, Verri A (1998) Introductory Techniques for Computer Vision. NewJersey, Prentice Hall

  • Noise Filtering of New Motion Capture Markers Using Modied K-Means 189

    32. Zheng K, Zhu Q, Zhuang Y, Pan Y (2001) Motion Processing in Tight-ClothingBased Motion Capture. In: Robot Vision. Auckland, pp 15

    33. ZuWhan K (2001) Multi-View 3-D Object Description With Uncertain Rea-soning and Machine Learning. PhD Thesis, Faculty of the graduate school

    34. Elec2go (2006) Elec2go. Retrieved July 30-update, 2006, Available at www.elec2go.com.au/index.htm

    35. Vanier L, Kaczmarski H, Chong L, Blackburn B, Williams M, Velder A(2003) Connecting the Dots: The Dissection of a Live Optical Motion CaptureAnimation Dance Performance, Available at www.isl.uiuc.edu/Publications/nal20dance1.pdf

    36. Furnee E (1988) Speed, Precision and Resolution of a TV-Based Motion Analy-sis Computer. In: 10th IEEE Engineering in Medicine and Biology Society.p 656

    37. Knaungo T, Netanyahu N, Wu A (2002) An Ecient K-Means Clustering Algo-rithm: Analysis and Implementation. IEEE Transactions on Pattern Analysisand Machine Intelligence. pp 881892

    38. Whitten I, Frank E (2005) Data Mining Practical Machine Learning Tools andTechniques second edition. San Fransisco, Morgan Kaufman Publishers

    39. Jain K, Dubes E (1988) Algorithms for Clustering Data. Prentice Hall, NewJersey

    40. Jain A, Murty M, Flynn P J (1999) Data Clustering: A Review. In: ACMComputing Surveys 31: 264232

    41. Kaufman L, Rosseeuw P (1990) Finding Groups in Data, an Introduction toCluster Analysis. New York, Wiely

    42. Hasegawa S, Imai H, Inaba M, Katoh N, Nakano J (1993) Ecient Algorithmsfor Variance Based Clustering. In: 1st Pacic Conference on Computater Graph-ics Applications. Seoul, pp 7589

    43. Abche A, Tzanakos G, Tzanakou E (1992) A new Method for Multimodal 3DImage Registration with External Markers. In: Medicine and Biology Society14: 18811882

    44. Bacao F, Lobo V, Painho M (2005) Self-organizing Maps as Substitutes forK-Means Clustering. Berlin Heidelberg New York, Springer, pp 476438

    45. Chimphlee W, Abdullah A, Sap M, Chimphlee S, Srinoy S (2005) Unsuper-vised Clustering Methods for Identifying Rare Events in Anomaly Detection.In: Transactions on Engineering, Computing and Technology 8: 253258

    46. Milligan G W (1980) An Examination of the Effects of Six Types of ErrorPerturbation of Fifteen Clustering Algorithms. In: Psychometrika 45: 325342

    47. Su T, Dy J (2004) A Deterministic Method for Initialising K-means Clustering.In: 16th IEEE international Conference on Tools with Articial Intelligence(ICTAI 2004). pp 784786

    48. DirectXtras.Inc (2003) DirectExtras. Retrieved Apr 27, 2006. Available atwww.asu.edu

  • Toward Eective Processing of InformationGraphics in Multimodal Documents:A Bayesian Network Approach

    Sandra Carberry1 and Stephanie Elzer2

    1 Department of Computer Science, University of Delaware, Newark, DEcarberry@cis.udel.edu

    2 Department of Computer Science, Millersville University, Pennsylvania, PAelzer@cs.millersville.edu

    Summary. Information graphics (non-pictorial graphics such as bar charts and linegraphs) are an important component of multimodal documents. When informationgraphics appear in popular media, such as newspapers and magazines, they gener-ally have a message that they are intended to convey. This chapter addresses theproblem of understanding such information graphics. The chapter presents a corpusstudy that shows the importance of taking information graphics into account whenprocessing a multimodal document. It then presents a Bayesian network approachto identifying the message conveyed by one kind of information graphic, simple barcharts, along with an evaluation of the graph understanding system. This work is therst (1) to demonstrate the necessity of understanding information graphics and tak-ing their communicative goal into account when processing a multimodal documentand (2) to develop a computational strategy for recognizing the communicative goalor intended message of an information graphic.

    1 Introduction

    Most documents are multimodal that is, they consist of both text andgraphics. However, document processing research, including work on thesummarization, storage, and retrieval of documents, as well as automatedquestion-answering, has focused almost entirely on an articles text; infor-mation graphics, such as bar charts and line graphs, have been ignored. Wecontend that information graphics play an important communicative role inmultimodal documents, and that they must be taken into account in sum-marizing and indexing the document, in answering questions from storeddocuments, and in providing alternative access to multimodal documents forindividuals with sight impairments.

    This chapter has two objectives: (1) to demonstrate, via corpus studies,the necessity of understanding information graphics and taking their commu-nicative goal into account when processing a multimodal document, and (2) toS. Carberry and S. Elzer: Toward Eective Processing of Information Graphics in Multimodal

    Documents: A Bayesian Network Approach, Studies in Computational Intelligence (SCI) 96,

    191212 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 192 S. Carberry and S. Elzer

    present a computational strategy for recognizing the communicative goal orintended message of one class of information graphics: simple bar charts. Ourwork is the rst to produce a system for understanding information graphicsthat have an intended message (as opposed to graphics that are only intendedto present data). Since the message identied by our system can serve asa brief summary of an information graphic, our research provides the ba-sis for taking information graphics into account when processing multimodaldocuments.

    The chapter is organized as follows. Section 2 relates our work to otherresearch eorts. Section 3 discusses the importance of information graphicsand presents the aforementioned corpus studies, along with two importantapplications that require analyzing and understanding information graphics.Section 4 presents an overview of our Bayesian network approach for recog-nizing the message conveyed by a simple bar chart, along with evaluationexperiments that demonstrate the systems eectiveness, and Sect. 5 discussesproblems that must be addressed to handle the full range of multimodal doc-uments. Although our computational strategy for recognizing the intendedmessage of an information graphic is currently limited to simple bar charts,we believe that the general approach is extendible to other kinds of informa-tion graphics.

    2 Related Work

    Researchers have investigated the generation of information graphics and theircaptions in multimodal documents [10,19,22]. In graphics generation, the sys-tem is given a communicative goal and must construct a graphic that achievesthat goal. For example, the AutoBrief system [19] identies the perceptual andcognitive tasks that a graphic must support and uses a constraint satisfactionalgorithm to design a graphic that facilitates these tasks as much as possible,subject to the constraints of competing tasks. In this context, perceptual tasksare ones that can be accomplished by viewing a graphic, such as comparingtwo bars in a bar chart to determine which is taller; cognitive tasks are onesthat require a mental computation, such as interpolating between the valuesassigned to two tick marks on the dependent axis in order to compute theexact value for the top of a bar in a bar chart. Our problem is the reverseof graphics generation we are given a graphic and must extract the com-municative signals present in the graphic and use them to reason backwardsabout the graphics intended message.

    Yu, Hunter, Reiter, and Sripada [40, 41] used pattern recognition tech-niques to summarize interesting features of time series data from a gas turbineengine. However, the graphs were automatically generated displays of the datapoints and did not have any intended message. Futrelle and Nikolakis [17] de-veloped a constraint grammar for parsing vector-based visual displays, andFutrelle is extending this work to construct a graphic that is a simpler form

  • Processing Information Graphics 193

    of one or more graphics in a document [16]. However, the end result is itselfa graphic, not a representation of the graphics intended message. Our workis the rst to address the understanding of an information graphic, with thegoal of processing multimodal documents.

    Much eort has been devoted to the processing of images. Bradshaw [3]notes that work on image retrieval has progressed from systems that retrievedimages based on low-level features such as color, texture, and shape to systemswhich attempt to classify and reason about the semantics of the images beingprocessed. This includes systems that attempt to classify images according toattributes such as indoor/outdoor, city/landscape, and man-made/articial.Srihari, Zhang, and Rao [35] examined text-based indexing techniques for thecaption and any collateral (accompanying) text combined with image-basedtechniques. Their work demonstrated the ineectiveness of text-based meth-ods alone, and they provide the example of a search for pictures of Clintonand Gore, which produced a nal set of 547 images. However, manual in-spection showed that only 76 of these images actually contained pictures ofClinton or Gore! Their work demonstrates, however, that when combined withimage-based retrieval techniques, the collateral text can provide a rich sourceof evidence for improving the information retrieval process. However, imageretrieval work is much dierent from our research, in that image retrievalis concerned with the semantics of images, such as President Bush at theWhite House or an elephant on the plains of Africa, whereas we are con-cerned with recognizing the communicative goal or intended message of aninformation graphic.

    3 The Importance of Understanding InformationGraphics

    Information graphics are non-pictorial graphics, such as bar charts and linegraphs, that display attributes of entities and relations among entities. Al-though some information graphics are only intended to display data [40, 41],the overwhelming majority of information graphics that appear in popularmedia, such as newspapers, magazines, and reports, have a message that theyare intended to convey. For example, the information graphic in Fig. 1 ostensi-bly is intended to convey the changing trend in optimism by small businesses.Clark [8] has argued that language consists of any deliberate signal that is in-tended to convey a message. Under this denition, language includes not onlytext and utterances, but also hand signals, facial expressions, and even infor-mation graphics. Thus, we view information graphics as a form of languagewith a communicative goal.

  • 194 S. Carberry and S. Elzer








    02 03 04 05 06


    Fig. 1. A simple bar chart from business week

    3.1 Can Information Graphics be Ignored?

    The question arises as to whether information graphics repeat portions of thetextual content of a multimodal document and thus can be ignored. Considerthe information graphic in Fig. 1. It appeared in a short (1/2 page) Busi-ness Week article entitled Upstarts Plan to Keep On Spending. Although thegraphics message is that there is a changing trend (from falling to rising) inthe number of small business companies optimistic about the US economy,this message is not captured by the articles text. The only part of the ac-companying article that comes close to the graphics message is the followingparagraph:

    A PricewaterhouseCoopers rst-quarter survey, which ran from lateFebruary to May, showed 76% of the fast-growing small businesses averaging an annual growth rate of about 25% said they were opti-mistic about the US economy for the coming year.

    But nowhere in the article is the current optimism contrasted with the situa-tion a few years earlier. Moreover, the article contrasts the Pricewaterhouse-Coopers survey with a survey by the National Federation of IndependentBusiness (NFIB); the changing trend in the graphic, although not mentionedin the articles text, is relevant to reconciling the dierences in the two sur-veys. We observed that the same phenomenon occurred even with more com-plex graphics and longer articles. For example, consider the two graphics inFig. 21 that appeared in a 1999 Business Week article that was six pages inlength, of which approximately four pages were text; the article was entitledA Small Town Reveals Americas Digital Divide. Both graphics are groupedbar charts. The message of the leftmost graphic is twofold: at all income levels,rural areas lag behind urban areas in terms of US households with Internetaccess, and the percent of US households with Internet access increases with

    1 The composite graphic contained three grouped bar charts. For reasons of space,only two are displayed here.The omitted graphic was a simple (not grouped) barchart addressing the relationship between race and Internet access.

  • Processing Information Graphics 195


    $75,000 PLUS

    $10,000 14,999

    $15,000 19,999

    $20,000 24,999

    $25,000 34,999

    $35,000 49,999

    $50,000 74,999



    B.A. OR MORE







    0 15 30 60 50403020100

    Fig. 2. Two grouped bar charts from business week

    income level. The message of the rightmost graphic is similar: at all educationlevels, rural areas lag behind urban areas in terms of US households with In-ternet access, and the percent of US households with Internet access increaseswith education level. Although this article explicitly refers to these graphicswith the reference (chart, page 191), the text still fails to capture the graph-ics messages. The segments of the accompanying article that come closest tothe graphics messages are the following:

    Blacksburg2 reinforces fears that society faces a digital divide of enor-mous breadth (chart, page 191). Blacksburg is the most wired townin the nation. Over the span of only ve years, more than 85% ofits 86,000 residents, including 24,000 students at Virginia Tech, havegone online far above the 32.7% national average. By contrast, inthe region surrounding Blacksburg, only some 14% are connected tothe Net.

    In Christiansburg3, nearly one-third of adults have no high schooldiploma and only 17% have college degrees vs. 61% in Blacksburg.Price4 frequently gets frustrated at the second-class connectivity hehas as a result of where he lives, the familys income, and his lack ofcomputer skills.

    But none of these text segments convey the relationship between income/ed-ucation and Internet access or that Internet access in rural areas lags behindthat in urban areas even for households with the same income and educationlevel, both of which are captured by the graphics. Furthermore, the readeris expected to connect the much lower income and education levels in rural

    2 Blacksburg, Virginia is an urban area in Virgina and is the location of VirginiaTech, a large university.

    3 Christiansburg is a rural town in Virginia.4 Price is the last name of a rural Virginia resident interviewed for the article.

  • 196 S. Carberry and S. Elzer



    Q1 Q1Q2Q4 Q4Q3Q1 Q2 Q3

    2003 2004

    more than $1 million in Florida:Number of resale condos sold forWorth a million







    Fig. 3. A line graph from USA today

    areas (conveyed by the text) with the correlations between income/educationand Internet access (conveyed by the information graphics), and make theinference that a much lower percentage of rural residents have Internet accessthan urban residents. This conclusion is central to the articles overall purposeof conveying the digital divide between rural and urban America.

    Newspapers, as well as magazines, often rely on the reader to integrate themessages of an articles information graphics into what is conveyed by the text.For example, Fig. 3 displays an information graphic taken from a 2005 USAToday article entitled Miami condo market sizzling. The graphics messageis ostensibly that the rising trend in the number of Florida resale condos soldfor more than a million dollars has risen even more sharply between 2004 and2005. But nowhere does the article talk about the price of resale condos. Thetext segment closest to the graphics message only addresses the price of newcondos:

    In Miami Beach and other communities, one-bedroom units in newoceanfronti projects start at close to $5,00,000 and run into themillions.

    Yet once again, the reader must recognize the message of the informationgraphic and integrate it with the communicative goals of the articles text inorder to fully understand the article.

    These observations lead to the hypothesis that information graphics can-not be ignored in processing multimodal documents. We conducted a corpusstudy to determine the extent to which the intended message of an informationgraphic in popular media is repeated in the articles text. We examined 100randomly selected graphics from a variety of magazines, such as Newsweek,Time, Fortune, and Business Week, and from both local and national news-papers; the corpus of graphics included simple bar charts, grouped bar charts,line graphs, multiple line graphs, and a few pie charts, and the accompanyingarticles ranged from very short (less than half a page) to long (more than 2

  • Processing Information Graphics 197

    Category A: Articles text fully conveysthe graphics message

    Category B: Articles text mostly conveysthe graphics message

    Category C: Articles text conveys a littleof the graphics message

    Category D: Articles text conveys noneof the graphics message

    Category C


    Category D

    Category A

    Category B




    Fig. 4. How often is a graphics message repeated in the accompanying article?

    magazine length pages). We identied the text segments most closely relatedto each graphics message and placed the graphic in one of four categories,depending on the extent to which the graphics message was captured by thearticles text, as shown in Fig. 4. In 39% of the instances in our corpus (Cat-egories A and B), the text was judged to fully or mostly convey the messageof the information graphic. In the remaining 61% of the graphics (CategoriesC and D), the text was judged to convey little or none of the graphics mes-sage. Thus, since information graphics in popular media do not just repeatportions of an articles text, they cannot be ignored in processing multimodaldocuments.

    It is interesting to contrast the use of information graphics in popular me-dia with their use in scientic articles. The text of a scientic article generallyexplicitly refers to each information graphic and summarizes its message. Forexample, the above paragraph explicitly referred to Fig. 4 and summarized itscontribution, namely that the message of an information graphic appearing inpopular media is often not repeated in the articles text. However, in popularmedia, explicit references to information graphics are not the norm; neitherof the graphics in Fig. 1 or 3 were explicitly referenced in their accompanyingarticles. And as illustrated by the graphics in Fig. 2, even when the articlerefers to the graphic, it might not summarize the graphics message.

    3.2 How Useful are Naturally Occurring Captions?

    Given that information graphics in a multimodal document cannot be ignored,perhaps the graphics caption can be relied on to capture the graphics in-tended message. Unfortunately, captions are of limited utility in automatingthe understanding of information graphics. In conjunction with their work ongenerating information graphics, Corio and Lapalme [10] analyzed the cap-tions on information graphics in order to devise rules for generating them.

  • 198 S. Carberry and S. Elzer

    7%Category C

    Category B

    Category D Category A44%



    Category B: Caption captures intention (somewhat)

    Category A: Caption captures intention (mostly)

    Category C: Caption hints at intention

    Category D: Caption makes no contribution to intention

    Fig. 5. Does a graphics caption capture its intended message?

    However, they found that captions are often very general. We conducted ourown corpus study with two objectives:

    1. To identify the extent to which a graphics caption captures the graphicsintended message

    2. To determine whether a general purpose natural language system wouldencounter any problems in parsing and understanding captions

    We compared the intended message5 of 100 bar charts with the graphicscaption. Each graphic was placed into one of four categories, as shown inFig. 5. In slightly more than half the instances (Categories C and D), thegraphics caption either made no contribution to understanding the graphicsmessage or only hinted at it. For example, a caption might be very generaland uninformative about a graphics message, such as the caption Delawarebankruptcies that appeared on an information graphic in a local Delawarenewspaper conveying that there was a sharp rise in Delaware bankruptcies in2001 in contrast with the decreasing trend from 1998 to 2000, or a captionmight only hint at a graphics message, as is the case for the caption on thegraphic in Fig. 1.

    Next we examined the 56 captions in Categories A, B, and C (those thatat least made some contribution to understanding the graphics message) toidentify how easily they could be parsed and understood by a general pur-pose natural language system. Unfortunately, we found that captions are oftensentence fragments or contain some other kind of ill-formedness. For example,the caption Small Businesses: Still Upbeat on the graphic in Fig. 1 is a sen-tence fragment, as is the overall caption Wired America: White, Urban, andCollege-Educated on the graphic in Fig. 2. Furthermore, many captions weredesigned to be cute or humorous, such as the Category-C caption Bad Moon

    5 The intended message had previously been annotated by two coders.

  • Processing Information Graphics 199

    Rising on a graphic that conveyed an increasing trend in delinquent debts.Interpretation of such captions would require extensive analogical reasoningthat is beyond the capability of current natural language systems.

    3.3 Applications of Graphic Understanding

    Although many research eorts have investigated the summarization of tex-tual documents ([20,23,24,2729,34,38,39] are a few examples), little attentionhas been given to graphics in multimodal documents. Yet with the advent ofdigital libraries, the need for intelligent summarization, indexing, and retrievalof multimodal documents has become apparent [25,35].

    To our knowledge, our work is the only research eort that has begun toaddress the issue of taking the messages conveyed by information graphicsinto account when processing and summarizing a multimodal document. Yetas our corpus analysis has shown, information graphics cannot be ignored.We contend that the core message of an information graphic can serve as abasis for incorporating the graphic into an overall summary of a multimodaldocument, thereby producing a richer summary that captures more of thedocuments content.

    Individuals who are blind face great diculty when presented with multi-modal documents. Although screen readers such as JAWS can read the textto the user via speech, graphics pose serious problems. W3C accessibilityguidelines recommend that web designers provide textual equivalents for allgraphics and images [36]; however, the provision of such alt text is generallyignored or poorly constructed. The WebInSight project [2] seeks to addressthis issue for the broad class of images on the web by utilizing a combinationof optical character recognition, web context labeling, and human labeling toproduce alt text. However, given that a large proportion of information graph-ics lack helpful captions, this approach will not suce. Researchers have de-vised systems that convey information graphics in alternative mediums suchas sound, tactile, or haptic representations [1,9,26,32,33,37]. However, theseapproaches have signicant limitations, such as requiring expensive equipmentor requiring that the user construct a mental map of the graphic, somethingthat Kennel observed is very dicult for users who are congenitally blind [21].

    We are taking a very dierent approach. Instead of attempting to trans-late the graphic into another modality, we hypothesize that the user should beprovided with the knowledge that would be gleaned from viewing the graphic.Thus we have designed a natural language system [15] that provides accessto multimodal documents by 1) identifying the message conveyed by its in-formation graphics (currently limited to simple bar charts), and 2) using ascreen reader to read the text to the user and to convey the messages of thedocuments information graphics. This system will eventually include an inter-active dialogue capability in which the system responds to followup questionsfrom users for further detail about the graphic. Our approach has a num-ber of advantages, including not requiring expensive equipment and placingrelatively little cognitive load on the user.

  • 200 S. Carberry and S. Elzer

    4 A Graph Understanding System

    As a rst step toward processing multimodal documents, we have developed aBayesian system for identifying the intended message of a simple (not groupedor stacked) bar chart such as the graphic in Fig. 1. Simple bar charts providea rich domain for graph understanding. They can convey a variety of dierentkinds of messages, such as trends, a contrast between a point in the graphicand a trend, a comparison between entities in a graphic, and the rank ofan entity in the graphic. In addition, a variety of mechanisms are used bygraphic designers to aid the user in recognizing the intended message of abar chart; such mechanisms include coloring a bar dierently from other barsin the graphic, mentioning a bars label in the caption, and graphic designchoices that make some perceptual tasks easier than others. Figure 6 gives anoverview of our algorithm for processing a simple bar chart, and the steps ofthe algorithm are described in more detail in the following sections. Althoughour work has thus far been limited to simple bar charts, we believe that ourmethodology is extendible to other kinds of information graphics.

    4.1 Input to the Graph Understanding System

    Input to the graph understanding system is an XML representation of thegraphic that is produced by a Visual Extraction Module (VEM) [7]. It speciesthe graphics components, including its axes, the location and heights of bars,the bar labels, their colors, the caption, etc. Although the VEM must processa raw image, the task is much more constrained, and thus much easier, thanmost image recognition problems. Currently, the VEM can handle electronicimages of simple bar charts that are clearly drawn in a xed set of fontsand with standard placement of labels and captions. For example, the VEMcould not produce XML for the graphic in Fig. 1 since the text Companiesoptimistic about US Economy appears within the bars rather than above thegraphic or on the dependent axis. Current work is removing these limitations.If the independent axis of the bar chart represents an ordinal attribute suchas years or ages, a preprocessing phase uses a set of heuristics to divide thebars into consecutive segments that might represent possible trends and addsthe best division, along with any salient divisions, to the XML representationof the graphic. Further detail on this preprocessing can be found in [12].

    4.2 A Bayesian Network for Graph Understanding

    To generate multimodal documents, the AutoBrief project [22] rst identi-ed which communicative goals would be achieved via text and which viagraphics. During the rst phase of graphics generation, media-independentcommunicative goals were mapped to perceptual and cognitive tasks that thegraphics should support. For example, if the goal is for the viewer to believethat Company A had the highest prots of a set of companies, then it would

  • Processing Information Graphics 201

    Input: electronic image of simple bar chartOutput: logical representation of the bar charts message

    1. Construct XML representation of the bar charts components (done byVisual Extraction Module: Sect. 4.1)

    2. If independent axis represents an ordinal attribute, augment XML represen-tation with division of bars into sequential subsegments representing possibletrends: (Sect. 4.1)

    3. Augment XML representation to indicate the presence of a verb in one ofthe identied verb classes (done by Caption Processing Module: Sect. 4.3)

    4. Augment XML representation to indicate the presence of a noun in thecaption that matches a bar label (done by Caption Processing Module:Sect. 4.3)

    5. Construct the non-leaf nodes of the Bayesian network by chaining betweengoals and their constituent subgoals (Sect. 4.2)

    6. Add conditional probability tables for each child node in the Bayesian net-work, as pre-computed from a corpus of bar charts (Sect. 4.4)

    7. Add evidence nodes to each perceptual task node in the Bayesian network,reecting evidence about whether that perceptual task is part of the planthat the viewer is intended to pursue in identifying the graphics message(Sect. 4.3)A. Add evidence capturing highlighting of the bars that are parameters ofthe perceptual taskB. Add evidence capturing annotation of the bars that are parameters ofthe perceptual taskC. Add evidence capturing the presence in the caption of nouns matchingthe labels of bars that are parameters of the perceptual taskD. Add evidence capturing whether a bar that is a parameter of the per-ceptual task stands out by being unusually tall with respect to other barsin the bar chartE. Add evidence capturing whether a bar that is a parameter of the percep-tual task is associated with the most recent date on a time lineF. Add evidence about the relative eort required for the perceptual task

    8. Add evidence nodes to the top-level node in the Bayesian network captur-ing whether one of the identied verb or adjective classes is present in thecaption (Sect. 4.3)

    9. Add conditional probability tables for each evidence node, as pre-computedfrom a corpus of bar charts (Sect. 4.4)

    10. Propagate the evidence through the Bayesian network11. Select the message hypothesis with the highest associated probability

    Fig. 6. Graph understanding algorithm

    be desirable to design a graphic that facilitates the tasks of comparing theprots of all the companies, locating the maximum prot, and identifying thecompany associated with the maximum. In the second phase of graphics gen-eration, a constraint satisfaction algorithm was used to design a graphic thatfacilitated these tasks to the best extent possible.

  • 202 S. Carberry and S. Elzer

    We view information graphics as a form of language, and take a planrecognition approach to recognizing the intended message of an informationgraphic. Plan recognition has been used extensively in understanding utter-ances and recognizing their intended meaning [4, 5, 31]. To understand infor-mation graphics, we reason in the opposite direction from AutoBrief givenan information graphic, we extract the communicative signals present in thegraphic as a result of choices made by the graphic designer, and we use theseto recognize the plan that the graphic designer intends for the viewer to per-form in deciphering the graphics intended message. The top level goal of thisplan captures the graphic designers primary communicative goal, namely themessage that the graphic is intended to convey.

    Following the approach introduced by Charniak and Goldman [6] for lan-guage understanding, we capture plan recognition in a probabilistic frame-work. The top level of our Bayesian network represents the twelve categoriesof messages that we have observed for simple bar charts, such as conveyinga trend (rising, falling, or stable), contrasting a point with a trend, convey-ing the rank of an entity, comparing two entities, etc. The next level of theBayesian network captures the possible instantiations of each of these messagecategories for the graphic being analyzed. For example, if a bar chart has sixbars, the parameter of the Get-Rank message category could be instantiatedwith the labels of any of the six bars. Lower levels in the Bayesian networkrepresent decompositions of the communicative goal represented by the parentnode into more specic subgoals and eventually into primitive perceptual andcognitive tasks that the viewer would be expected to perform. For example,getting the rank of a bar can be accomplished either by getting a bars rankgiven its label (perhaps the bars label was mentioned in the caption, therebymaking it salient to the viewer) or by getting a bars rank starting with the bar(perhaps the bar has been highlighted to draw attention to it in the graphic).Getting a bars rank given its label lx can be further decomposed into thethree perceptual tasks:

    1. Perceive-bar: perceive the bar bx whose label is lx2. Perceive-If-Sorted: perceive whether the bars appear in sorted order in

    the bar chart3. Perceive-Rank: perceive the rank of bar bx in the bar chart. (This task is

    much easier if the bars are in sorted order, as will be discussed in Sect. 4.3.)

    Given an information graphic, our system constructs the Bayesian networkfor it using the Netica [30] software for building and reasoning with Bayesiannetworks.

    4.3 Entering Evidence into the Bayesian Network

    In order to reason about the graphics most likely high-level communicativegoal and thereby recognize the graphics intended message, evidence fromthe graphic must be entered into the Bayesian network. The evidence takes

  • Processing Information Graphics 203

    the form of communicative signals present in the graphic, both as a result ofdesign choices made by the graphic designer and mutual beliefs of the designerand viewer about what the viewer will be interested in. These communicativesignals are multimodal in the sense that some are visual signals in the graphicitself and some take the form of words in the caption assigned to the graphic.

    Our rst set of communicative signals result from explicit actions on thepart of the graphic designer that draw attention to an entity in the graphic.These include highlighting a bar by coloring it dierently from other barsin the bar chart, annotating a bar with its value or a special symbol, andmentioning the bars label in the caption. The XML representation of thegraphic contains each bars color and any annotations, so identifying bars thatare salient due to highlighting or annotation is easy. Our Caption ProcessingModule [13] uses a part-of-speech tagger to extract nouns from the captionand match them against the bar labels, thereby identifying any bars that aresalient by virtue of being mentioned in the caption.

    Our second set of communicative signals take into account presumed mu-tual beliefs by the graphic designer and the viewer about entities that willdraw the viewers attention. Thus any bars that are much taller than otherbars in the bar chart or a bar associated with the most recent date on atimeline are noted as salient entities, since viewers will presumably notice abar that diers signicantly in height from the other bars and will be mostinterested in recent events.

    Our third set of communicative signals are the relative diculty of dier-ent perceptual tasks in the graphic. The design of a graphic can make someperceptual tasks easier than others. For example, it is much easier to identifythe taller of two bars in a bar chart if the two bars are located adjacent toone another and are signicantly dierent in height than if they are inter-spersed with other bars and their heights are similar. We have adopted theAutoBrief hypothesis [22] that graphic designers construct a graphic that fa-cilitates as much as possible the most important perceptual tasks for achievingthe graphics communicative goal. Thus the relative diculty of dierent per-ceptual tasks serves as a communicative signal about which tasks the viewerwas intended to perform in deciphering the graphics message.

    To extract this communicative signal from a bar chart, we constructed aset of eort estimation rules that compute the eort required for a variety ofperceptual tasks that might be performed on a given graphic. Each rule rep-resents a perceptual task and consists of a set of condition-computation pairs.Each condition part of a rule captures characteristics of the graphic that mustapply in order for its associated computation to be applicable. For example,consider the bar chart displayed in Fig. 7. It illustrates three conditions thatmight hold in a bar chart: (1) a bar might be explicitly annotated with itsvalue, as is the case for the bar labelled Norway; (2) a bar might not be anno-tated with its value, but the top of the bar might be aligned with a labelledtick mark on the dependent axis, as is the case for the bar labelled Denmark;or (3) determining the bars value might require interpolation between the

  • 204 S. Carberry and S. Elzer

    Fig. 7. A bar chart illustrating dierent amounts of perceptual eort

    values of two labelled tick marks on the dependent axis, as is the case for thebar labelled Britain. Our rule for estimating the eort required to determinethe value associated with the top of a bar captures each of these dierentconditions, listed in order of increasing eort required to perform the task,and species the computation to apply when the condition is satised; the as-sociated eort computations are based on research by cognitive psychologists.Our eort estimation rules were validated by eyetracking experiments and arepresented in [14].

    The above communicative signals provide evidence regarding perceptualtasks that the viewer might be intended to perform. Each instance of a per-ceptual task has instantiated parameters; for example, the perceptual task

    Perceive-Rank( viewer, bar, rank)

    has bar as one of its parameters. If the particular bar instantiating the barparameter is salient by virtue of a communicative signal in the graphic, thenthat serves as evidence that the viewer might be intended to perform thisparticular perceptual task. Similarly, the amount of eort required to performthe Perceive-Rank task also serves as evidence about whether the viewer wasreally intended to perform the task.6 Thus evidence nodes capturing thesecommunicative signals are attached to each primitive perceptual task node inthe Bayesian network.

    Our last set of communicative signals are the presence of a verb or adjectivein the caption that suggests a particular category of message. For example,although it would be very dicult to extract the graphics message fromthe humorous caption Bad Moon Rising7, the presence of the verb rising

    6 Note that if the bars appear in order of height in the bar chart, then the eortrequired for the Perceive-Rank task will be much lower than if they are ordereddierently, such as in alphabetical order of their labels.

    7 This caption appeared on a graphic conveying an increasing trend in delinquentdebts.

  • Processing Information Graphics 205

    Table 1. A sample conditional probability table

    PerceiveRank( viewer, bar, rank) InPlan NotInPlan

    Only bar is annotated 24.99 2.3bar and others are annotated 0.01 0.9only bars other than bar are annotated 0.01 19.5no bars are annotated 74.99 77.3

    suggests the increasing trend category of message. We identied a set of verbsthat might suggest one of our 12 categories of messages and organized theminto classes containing similar verbs. For example, one verb class containsverbs such as rise, increase, grow, improve, surge, etc. Our Caption ProcessingModule identies the presence in the caption of a verb from one of our verbclasses or an adjective (such as growing in the caption A Growing BiotechMarket) that is derived from such a verb. Since this kind of communicativesignal suggests a particular category of high-level message, verb and adjectiveevidence nodes are attached to the top-level node in the Bayesian network.

    4.4 Computing the Probability Tables

    Associated with each child node in a Bayesian network is a conditional prob-ability table that species the probability of the child node given the valueof a parent node. For our application, the value of the parent node is eitherthat it is, or is not, part of the plan that the viewer is intended to pursuein recognizing the graphics message. Table 1 displays the conditional proba-bility table for the annotation evidence node attached to the Perceive-Ranknode in the Bayesian network. It indicates that if the viewer is intended toperceive the rank of the particular bar that instantiates bar, then the prob-ability is 24.99% that this particular bar is the only bar annotated, and theprobability is 74.99% that no bars are annotated in the graphic. Negligiblenon-zero probabilities are assigned to situations in which this bar and othersare annotated or in which only other bars are annotated. Similarly, the tablecaptures the probability of the bar being annotated given that Perceive-Rankis not part of the plan that the viewer is intended to pursue.

    4.5 Examples of Message Recognition

    Consider the graphic displayed in Fig. 8. The graphics caption is uninforma-tive about the graphics intended message; it could be attached to a graphicconveying a variety of messages, including the relative rank of dierent recordcompanies in terms of album sales or a comparison of the sales of two partic-ular record companies. However, our system hypothesizes that the graphic isconveying a changing trend in record album sales, with sales rising from 1998to 2000 and then falling from 2000 to 2002.

  • 206 S. Carberry and S. Elzer






    The sound of salesTotal albums sold in first quarterIn millions

    Fig. 8. A slight variation of a graphic from USA today


    P P

    er C



    1 (in



























    Fig. 9. A variation of a graphic from US news and world report (In the originalgraphic, the bar for the United States was annotated. Here we have highlighted it.We have also placed the dependent axis label alongside the dependent axis, insteadof at the top of the graph.)

    Now consider the graphic in Fig. 9. If the bar for the United States isnot highlighted, then our system hypothesizes that the graphic is conveyingthe relative rank of the dierent countries in terms of GDP per capita. How-ever, when the bar for the United States is colored dierently from the otherbars, as in Fig. 9, it becomes salient. In this case, our system hypothesizesa dierent message namely, that the graphic is intended to convey thatthe United States ranks third in GDP per capita among the countries listed.Similar results would be obtained if the bar for the United States were nothighlighted, but a caption such as United States Productivity were attachedto the graphic, thereby again making the bar for the United States salient bymentioning its label in the caption.

  • Processing Information Graphics 207

    4.6 Evaluation

    We evaluated the eectiveness of our graph understanding system on a corpusof 110 bar charts whose intended message had been previously annotated bytwo coders. Since the corpus is small, we used leave-one-out cross validation inwhich each bar chart was used once as the test graphic and the other 109 barcharts were used to compute the probabilities for the nodes in the Bayesiannetwork. The system was credited with success if its top-rated hypothesismatched the message assigned to the bar chart by the human coders and theprobability that the system assigned to its hypothesis exceeded 50%. Overallsuccess was computed as the average of all 110 experiments. Our systemssuccess rate was 79.1%, which far exceeds any baselines such as the frequencyof the most prevalent type of message (rising trend at 23.6%). But it shouldbe noted that the system must identify both the category and parameters ofthe message. For example, the system must not only recognize when a barchart is conveying the rank of an entity in the graphic but must also identifythe specic entity in question.

    Since we are interested in the impact of dierent communicative signalsand their particular modality, we undertook an additional experiment in whichwe evaluated how each kind of evidence impacted our systems ability torecognize the graphics message. As a baseline, we used the systems successrate when all evidence nodes are included in the network, which is 79.1%.For each type of evidence, we then computed the systems success rate whenthat evidence node was disabled in the Bayesian network, and we analyzedthe resulting degradation in performance (if any) from the baseline. It shouldbe noted that disabling an evidence source means that we remove the abilityof that kind of evidence to contribute to the probabilities in the Bayesiannetwork. This diers from merely failing to record the presence of that kindof evidence in the graphic, since both the presence and absence of a particularcommunicative signal is evidence.

    We used a one-tailed McNemar test [11,18] for the signicance of change inrelated samples. Our samples are related since we are comparing performanceby a baseline system with performance by a system that has been perturbedby omitting an evidence source. Table 2 displays the results for the evidencesources where the performance degradation is signicant at the .05 level orbetter.

    It is interesting to note that the evidence sources that aect performanceinclude signals from both the visual modality (such as highlighting in thegraphic and the relative eort of dierent perceptual tasks) and the textualmodality (such as a noun in the caption matching a bar label in the graphic).

    Disabling evidence regarding the mention of a bar label in the caption (ref-erred to as Noun-matching-bar-label in Table 2) caused the greatest degra-dation in performance. We examined those bar charts where a bar label wasreferenced in the caption and the intended message was correctly identied bythe baseline system with all evidence sources enabled. We found that in ten

  • 208 S. Carberry and S. Elzer

    Table 2. Degradation in performance with omission of evidence source

    Baseline: system with all evidence 79% success rate

    Success McNemar pType of evidence omitted rate (%) statistic value

    Noun-matching-bar-label evidence 70 8.100 .005Eort evidence 71 5.818 .01Current-date evidence 72 6.125a .01Highlighting evidence 74 3.125 .05Salient-height evidence 74 3.125 .05

    aThe McNemar test is based on (1) the number correct by System-1 and wrongby System-2, and (2) the number wrong by System-1 and correct by System-2.Thus although a greater dierence in success rates usually correlates with greaterstatistical signicance, this is not always the case

    instances where other evidence made the referenced bar salient (such as high-lighting the bar or the bar being signicantly taller than other bars in the barchart), the system with Noun-matching-bar-label evidence disabled was stillable to recognize the graphics intended message. Thus we see that althoughthe absence of one evidence source may cause performance to degrade, thisdegradation can be mitigated by other compensating evidence sources.

    5 Conclusion and Discussion

    In this chapter, we have demonstrated the importance of information graph-ics in a multimodal document. We have also shown that a graphics captionis often very general and uninformative, and therefore cannot be used as asubstitute for the graphic. Thus it is essential that information graphics beunderstood and their intended messages taken into account when processingmultimodal documents. Our graph understanding system is a rst step towardthis goal. It extracts communicative signals from an information graphic andenters them into a Bayesian network that can hypothesize the message con-veyed by the graphic. To our knowledge, no other research eort has addressedthe problem of inferring the intended message of an information graphic.

    Our implemented system is limited to simple bar charts; we are currentlyextending our methodology to other kinds of information graphics, such as linegraphs and grouped bar charts. The latter are particularly interesting sincethey often convey two messages, as was seen for the graphics in Fig. 2. We arealso investigating the synergy between recognition of a graphics message andidentifying the topic of an article. Our graph understanding system exploitscommunicative signals in the graphic and its caption. However, if an entityin the graphic is mentioned in the articles text, it becomes salient in thegraphic. On the other hand, the graphic can suggest the focus or topic of

  • Processing Information Graphics 209

    the article. For example, one graphic in our corpus highlights the bar forAmerican Express, and the intended message hypothesized by our system isthat the graphic conveys the rank of American Express among the credit cardcompanies listed. Although the article mentions a number of dierent creditcard companies, the focus of the graphic is on American Express and thissuggests that the article is about American Express.

    Our system for providing blind individuals with eective access to multi-modal documents is being eld-tested, and the initial reaction from users isvery positive. Currently, only the graphics intended message is included inthe initial summary of the graphic that is presented to the user. Our next stepis to identify what additional information (if any) should be included, alongwith the intended message, in the initial summary. For example, if a bar chartconveys an overall rising trend but one bar deviates from this trend, shouldthis exceptional bar be mentioned in the initial summary of the graphic?Furthermore, should the graphics initial summary repeat information in thearticles text? For example, if it is deemed important to mention the valuesat the end points of the trend, should this information be repeated in thegraphics initial summary if it is already part of the articles text that is beingread to the user?8 And nally, we must develop the interactive natural lan-guage dialogue capability that will enable the user to ask followup questionsregarding the graphic.

    The next step in our digital libraries project is to develop a summarizationstrategy that takes into account both a documents text and the messagesconveyed by its information graphics. This will entail determining when thegraphics message is redundant and has already been captured by the text. Wemust also develop a method for coherently integrating the graphics messagewith a summary of the articles text. Given the importance of informationgraphics in a multimodal document, we believe that our approach will resultin a richer and more complete summary, which can then be used to moreeectively index and retrieve documents in a digital library.


    This material is based upon work supported by the National Science Founda-tion under Grant No. IIS-0534948.


    1. James Alty and Dimitrios Rigas. Exploring the use of structured music stimulito communicate simple diagrams: The role of context. International Journal ofHuman-Computer Studies, 62(1):2140, 2005.

    8 This question was raised by Seniz Demir and Kathy McCoy, colleagues on theproject.

  • 210 S. Carberry and S. Elzer

    2. Jerey Bigham, Ryan Kaminsky, Richard Ladner, Oscar Danielsson, andGordon Hempton. Webinsight: Making web images accessible. In Proceedingsof the Eighth International ACM SIGACCESS Conference on Computers andAccessibility, pages 181188, 2006.

    3. Ben Bradshaw. Semantic based image retrieval: A probabilistic approach. InProceedings of the 8th ACM International Conference on Multimedia, pages167176, 2000.

    4. Sandra Carberry. Plan Recognition in Natural Language Dialogue. ACL-MITPress Series on Natural Language Processing. MIT, Cambridge, Massachusetts,1990.

    5. Sandra Carberry. Techniques for plan recognition. User Modeling and User-Adapted Interaction, 11(12):3148, 2001.

    6. Eugene Charniak and Robert Goldman. A bayesian model of plan recognition.Articial Intelligence Journal, 64:5379, 1993.

    7. Daniel Chester and Stephanie Elzer. Getting computers to see informationgraphics so users do not have to. In Proceedings of the 15th International Sym-posium on Methodologies for Intelligent Systems, pages 660668, 2005.

    8. Herbert Clark. Using Language. Cambridge University Press, Cambridge, 1996.9. Robert F. Cohen, Arthur Meacham, and Joelle Ska. Teaching graphs to visually

    impaired students using an active auditory interface. In SIGCSE 06: Proceed-ings of the 37th SIGCSE technical symposium on Computer science education,pages 279282, 2006.

    10. Marc Corio and Guy Lapalme. Generation of texts for information graphics.In Proceedings of the 7th European Workshop on Natural Language GenerationEWNLG99, pages 4958, 1999.

    11. Wayne Daniel. Applied Nonparametric Statistics. Houghton Miin, Boston,1978.

    12. Stephanie Elzer. A Probabilistic Framework for the Recognition of Intention inInformation Graphics. PhD thesis, University of Delaware, Newark, DE 19716,2006.

    13. Stephanie Elzer, Sandra Carberry, Daniel Chester, Seniz Demir, Nancy Green,Ingrid Zukerman, and Keith Trnka. Exploring and exploiting the limited utilityof captions in recognizing intention in information graphics. In Proceedingsof the 43rd Annual Meeting of the Association for Computational Linguistics,pages 223230, 2005.

    14. Stephanie Elzer, Nancy Green, Sandra Carberry, and James Homan. A modelof perceptual task eort for bar charts and its role in recognizing intention. UserModeling and User-Adapted Interaction, 16(1): 130, 2006.

    15. Stephanie Elzer, Edward Schwartz, Sandra Carberry, Daniel Chester, SenizDemir, and Peng Wu. A browser extension for providing visually impaired usersaccess to the content of bar charts on the web. In International Conference onWeb Information Systems, pages 5966, 2007.

    16. Robert Futrelle. Summarization of diagrams in documents. In I. Maniand M. Maybury, editors, Advances in Automated Text Summarization, pages403421. MIT, Cambridge, 1999.

    17. Robert Futrelle and Nikos Nikolakis. Ecient analysis of complex diagramsusing constraint-based parsing. In Proceedings of the Third International Con-ference on Document Analysis and Recognition, pages 782790, 1995.

    18. Graphpad software. quickcalcs: Online calculators for scientists (2002).http://www.graphpad.com/quickcalcs/McNemarEx.cfm.

  • Processing Information Graphics 211

    19. Nancy Green, Giuseppe Carenini, Stephan Kerpedjiev, Joe Mattis, JohannaMoore, and Steven Roth. Autobrief: An experimental system for the automaticgeneration of briengs in integrated text and graphics. International Journal ofHuman-Computer Studies, 61(1):3270, 2004.

    20. E. Hovy and C.-Y. Lin. Automated text summarization in summarist. In I. Maniand M. Maybury, editors, Advanced in Automatic Text Summarization, pages8194. MIT, Cambridge, 1999.

    21. A. Kennel. Audiograf: A diagram-reader for the blind. In Second Annual ACMConference on Assistive Technologies, pages 5156, 1996.

    22. Stephan Kerpedjiev and Steven Roth. Mapping communicative goals into con-ceptual tasks to generate graphics in discourse. In Proceedings of the Interna-tional Conference on Intelligent User Interfaces, pages 6067, 2000.

    23. Inderjeet Mani and Mark Maybury, editors. Advances in Automatic Text Sum-marization. MIT, Cambridge, 1999.

    24. Daniel Marcu. The rhetorical parsing of unrestricted texts: A surface-basedapproach. Computational Linguistics, 26(3):395448, 2000.

    25. Mark Maybury, editor. Intelligent Multimedia Information Retrieval. MIT,Cambridge, 1997.

    26. David K. McGookin and Stephen A. Brewster. Soundbar: exploiting multipleviews in multimodal graph browsing. In NordiCHI 06: Proceedings of the 4thNordic conference on Human-computer interaction, pages 145154, 2006.

    27. Marie-Francine Moens, Roxana Angheluta, and Jos Dumortier. Generic tech-nologies for single and multi-document summarization. Information Processingand Management, 41(3):569586, 2005.

    28. Jane Morris and Graeme Hirst. Non-classical lexical semantic relations. InProceedings of the HLT Workshop on Computational Lexical Semantics, pages4651, 2004.

    29. Ani Nenkova. Automatic text summarization of newswire: Lessons learned fromthe document understanding conference. In Proceedings of National Conferenceon Articial Intelligence (AAAI), pages 14361441, 2005.

    30. Norsys Software Corp.: Netica, 2005.31. Raymond Perrault and James Allen. A Plan-Based Analysis of Indirect Speech

    Acts. American Journal of Computational Linguistics, 6(34):167182, 1980.32. Rameshsharma Ramloll, Wai Yu, Stephen Brewster, Beate Riedel, Mike Burton,

    and Gisela Dimigen. Constructing sonied haptic line graphs for the blind stu-dent: First steps. In Proceedings of the 4th ACM Conference on Assistive Tech-nologies, pages 1725, 2000.

    33. Martin Rotard, Sven Knodler, and Thomas Ertl. A tactile web browser forthe visually disabled. In HYPERTEXT 05: Proceedings of the sixteenth ACMconference on Hypertext and hypermedia, pages 1522, 2005.

    34. Barry Schiman, Inderjeet Mani, and Kristian Concepcion. Producing bio-graphical summaries: Combining linguistic knowledge with corpus statistics. InProceedings of the 39th Annual Meeting of the Association for ComputationalLinguistics, pages 450457, 2001.

    35. Rohini K. Srihari, Zhongfei Zhang, and Aibing Rao. Intelligent indexing andsemantic retrieval of multimodal documents. Information Retrieval, 2(2):137,2000.

    36. W3c: Web accessibility initiative. http://www.w3c.org/wai/.

  • 212 S. Carberry and S. Elzer

    37. Steven Wall and Stephen Brewster. Tac-tiles: Multimodal pie charts for visuallyimpaired users. In Proceedings of the 4th Nordic Conference on Human-computerInteraction, pages 918, 2006.

    38. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. Towards an iterativereinforcement approach for simultaneous document summarization and keywordextraction. In Proceedings of the 45th Annual Meeting of the Association forComputational Linguistics, pages 552559, 2007.

    39. Jen-Yuan Yeh, Hao-Ren Ke, Wei-Pang Yang, and I-Heng Meng. Text summa-rization using a trainable summarizer and latent semantic analysis. InformationProcessing and Management, 41(1):7595, 2005.

    40. Jin Yu, Ehud Reiter, Jim Hunter, and Chris Mellish. Choosing the content oftextual summaries of large time-series data sets. Natural Language Engineering,13:2549, 2007.

    41. Jin Yu, Jim Hunter, Ehud Reiter, and Somayajulu Sripada. Recognisingvisual patterns to communicate gas turbine time-series data. In Proceedings of22nd SCAI International Conference on Knowledge-Based Systems and AppliedArticial Intelligence (ES2002), pages 105118, 2002.

  • Fuzzy Audio Similarity Measures Basedon Spectrum Histograms and FluctuationPatterns

    Klaas Bosteels and Etienne E. Kerre

    Fuzziness and Uncertainty Modelling Research GroupDepartment of Applied Mathematics and Computer ScienceGhent University, Krijgslaan 281 (S9), B-9000 Gent, BelgiumKlaas.Bosteels@UGent.be, Etienne.Kerre@UGent.be

    Summary. Spectrum histograms and uctuation patterns are representations ofaudio fragments. By comparing these representations, we can determine the sim-ilarity between the corresponding fragments. Traditionally, this is done using theEuclidean distance. In this chapter, however, we study an alternative approach,namely, comparing the representations by means of fuzzy similarity measures. Oncethe preliminary notions have been addressed, we present a recently introduced tri-parametric family of fuzzy similarity measures, together with several constraintson its parameters that warrant certain potentially desirable or useful properties. Inparticular, we present constraints for several forms of restrictability, which allow toreduce the computation time in practical applications. Next, we use some membersof this family to construct various audio similarity measures based on spectrum his-tograms and uctuation patterns. To conclude, we analyse the performance of theconstructed audio similarity measures experimentally.

    1 Introduction

    Portable audio players can store several thousands of songs these days, andonline music stores currently oer millions of tracks. This abundance of musicdrastically increases the need for applications that automatically analyse, re-trieve or organize audio les. Measures that are able to express the similaritybetween two given audio fragments, are a fundamental component in many ofthese applications (e.g. [16]). In particular, many computational intelligencemethods for organizing and exploring music collections rely on such an audiosimilarity measure. The SOM-enhanced JukeBox presented in [6], which usesunsupervised neural networks to build geographical maps of music archives,is a noteworthy example.

    Usually, audio similarity measures are constructed using a feature-basedapproach. The audio fragments are represented by real-valued feature vectors,

    K. Bosteels and E.E. Kerre: Fuzzy Audio Similarity Measures Based on Spectrum Histograms

    and Fluctuation Patterns, Studies in Computational Intelligence (SCI) 96, 213231 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 214 K. Bosteels and E.E. Kerre

    and the similarity is calculated by comparing these vectors. We consider twotypes of feature vectors in this chapter: Spectrum histograms and uctuationpatterns. So far, the Euclidean distance has always been used for comparingfeature vectors of these types. By identifying the feature vectors with fuzzysets, however, the possibility arises to use fuzzy similarity measures for thistask. In this chapter, we investigate this alternative approach.

    2 Related Work and Motivation

    The audio similarity measure introduced by Aucouturier and Pachet in [1],which can be regarded as an improvement of a technique by Logan andSalomon [7], is well-known in its eld. This measure calculates the similar-ity between two given audio fragments by comparing mixtures of Gaussiandistributions that model the spectral information in the fragments. Mandaland Ellis proposed a simplied version of this approach [8]. They use a singleGaussian to model the spectral information, and compute the distance be-tween two of these Gaussians by means of the symmetric KullbackLeiblerdivergence. Calculating the Euclidean distance between the spectrum his-tograms [4] derived from the audio fragments is an alternative spectral ap-proach that is even easier to implement and compute. Nevertheless, the ex-perimental evaluation in [9] indicates that this approach based on spectrumhistograms can outperform the above-mentioned more complex techniques insome cases.

    Fluctuation patterns, which were originally called rhythm patterns [5], con-tain information that is complementary to spectral characteristics. Therefore,Pampalk combined a spectral audio similarity measure with the Euclidean dis-tance between uctuation patterns, and further optimized this combination bytaking into account some additional information derived from the uctuationpatterns [3]. This led to the audio similarity measure that won the MIREX06(Music Information Retrieval Evaluation eXchange 2006) audio-based musicsimilarity and retrieval task.1

    Hence, both spectrum histograms and uctuation patterns can be con-sidered to be audio representations that play an important role in the cur-rent state of the art. Since the Euclidean distance has always been used tocompare these representation so far, employing other approaches for the com-parison is an interesting research direction that still needs to be explored.As mentioned in the introduction, we propose fuzzy similarity measures asalternatives for the Euclidean distance in this chapter. This does not addany unwanted complexity because many fuzzy similarity measures are veryeasy to implement and compute, and fuzzy similarity measures oer the addi-tional advantage of being studied extensively and having very solid theoreticalfoundations. The main goal of this chapter is demonstrating that by using

    1 http://www.music-ir.org/mirex2006.

  • Fuzzy Audio Similarity Measures 215

    fuzzy similarity measures instead of the Euclidean distance for comparing thespectrum histograms or uctuation patterns, we can obtain a framework forgenerating theoretically well-founded audio similarity measures that satisfythe specic properties required for a particular application, and that performat least as well as the corresponding audio similarity measures based on theEuclidean distance.

    3 Preliminaries

    3.1 The Considered Representations of Audio Fragments

    Audio fragments contain a lot of information. Therefore, they are typicallyreduced to relatively compact real-valued vectors before they are compared.Such a vector is usually called a feature vector , and its individual componentsare called features. Feature extraction is the process of converting a given audiofragment to the corresponding feature vector. Many types of feature vectorshave been suggested in the literature. In this chapter, we restrict ourselvesto spectrum histograms and uctuation patterns. Both are derived from aspectrogram.


    For a given audio segment, the Fourier transform can be used to calculatethe amplitude that corresponds with each frequency. By dividing an audiofragment in short subsequent segments and applying the Fourier transformto each of these segments, we get the amplitude for each time-frequency pair.Such a representation of an audio fragment is called a spectrogram. The in-dividual frequencies of a spectrogram are usually consolidated into frequencybands to reduce the computation time. Furthermore, the amplitudes are nor-mally converted to loudness values, i.e., values that are proportional to theperceived intensity of the frequency in question.

    We consider two types of spectrograms in this chapter. For the rst type,we use the Mel scale for the frequencies, and the decibel scale for the loudnessvalues. We call spectrograms of this type Mel spectrograms. The scales used forthe second type of spectrograms are bark and sone, instead of Mel and deci-bel, respectively. We use the term sonogram for a spectrogram of this type. Intheory, the sonogram should perform best because its scales correspond bet-ter with human perception. An incontestable disadvantage of the sonogram,however, is the fact that it requires signicantly more computation time.

    Spectrum Histograms

    Starting from a spectrogram, we can calculate a spectrum histogram (SH) [3,4]by counting how many times certain loudness levels are reached or exceeded

  • 216 K. Bosteels and E.E. Kerre

    10 20 30 40 50 60




    loudness level






    10 20 30 40 50 60




    modulation frequency






    Fig. 1. The sonogram-based SH (a) and FP (b) for a fragment of Temps pournous by Axelle Red. White depicts zero, and black represents the maximum value

    in each frequency band. In this way, we get a simple summarization of thespectral shape of the audio fragment. This summarization is, to some extent,related to the perceived timbre of the audio fragment.

    We used two implementations of SHs for this chapter. The rst implemen-tation is based on the Mel spectrogram, and the second one on the sonogram.Both implementations are written in Matlab using the MA toolbox [10], andin both cases a SH is a matrix with 30 rows (frequency bands) and 60 columns(loudness levels). Figure 1a shows an example of a SH.

    Fluctuation Patterns

    By applying the Fourier transform to the subsequent loudness values in eachfrequency band of a segment of a spectrogram, we obtain the amplitudesthat correspond with the loudness modulation frequencies for each frequencyband. We get the uctuation pattern (FP) [3, 5] for an audio fragment bycalculating weighted versions of these coecients for subsequent segments ofthe spectrogram, and then taking the mean of the values obtained for eachsegment. Since FPs describe the loudness uctuations for each frequency band,they are, to some extent, related to the perceived rhythm.

    For implementing FPs, we again used the MA toolbox. Our rst imple-mentation derives FPs from the Mel spectrogram, while the second one usesthe sonogram. Both implementations generate FPs that are, like the SHs, 30by 60 matrices in which the rows correspond with frequency bands. In thiscase, however, the columns represent modulation frequencies (ranging from 0to 10 Hz). Figure 1b shows an example.

    3.2 Mathematical Foundations

    In this subsection, we introduce some basic notions from fuzzy set theory,namely, fuzzy sets, fuzzy aggregation operators, and fuzzy similarity measures.

    Fuzzy Sets

    Let X be the universe of discourse, i.e., the set of all considered objects. In thecase of an ordinary or crisp set A in X, we have either x A or x / A for each

  • Fuzzy Audio Similarity Measures 217

    element x from X. Hence, a crisp set can be represented by a characteristicX {0, 1} mapping. To avoid notational clutter, we reuse the name of acrisp set for its characteristic mapping. For instance, denotes the empty setas well as the mapping from the universe X to [0, 1] given by: (x) = 0, forall x X. We use the notation P(X) for the class of crisp sets in X, and wewrite PF(X) for the set of all nite crisp sets in X.

    Now, the concept of a crisp set can be generalized as follows:

    Denition 1. A fuzzy set A in a universe X is a X [0, 1] mapping thatassociates with each element x from X a degree of membership A(x).

    We use the notation F(X) for the class of fuzzy sets in X. For two fuzzy setsA and B in X, we write A B if A(x) B(x) for all x X, and A = B iA B B A.

    The classical set-theoretic operations intersection and union can be gen-eralized to fuzzy sets by means of a conjunctor and a disjunctor.

    Denition 2. A conjunctor C is an increasing [0, 1]2 [0, 1] mapping thatsatises C(0, 0) = C(0, 1) = C(1, 0) = 0 and C(1, 1) = 1.Denition 3. A disjunctor D is an increasing [0, 1]2 [0, 1] mapping thatsatises D(1, 1) = D(0, 1) = D(1, 0) = 1 and D(0, 0) = 0.Denition 4. Let C be a conjunctor. The C-intersection A CB of two fuzzysets A and B in X is the fuzzy set in X given by, for all x X:

    (A CB)(x) = C(A(x), B(x)) (1)Denition 5. Let D be a disjunctor. The D-union A DB of two fuzzy setsA and B in X is the fuzzy set in X given by, for all x X:

    (A DB)(x) = D(A(x), B(x)) (2)To conclude this section, we dene the concepts support and sigma count:

    Denition 6. The support suppA of a fuzzy set A in X is given by:

    suppA = {x X | A(x) > 0} (3)Denition 7. The sigma count |A| of a fuzzy set A in X with nite supportis given by:

    |A| =xX

    A(x) (4)

    It is not hard to see that the sigma count is a generalization of the crisp conceptcardinality to fuzzy sets. As stated in its denition, this generalization is onlydened for fuzzy sets with nite support. We call such fuzzy sets nite, andwe use the notation FF(X) for the class of nite fuzzy sets in X. Obviously,all fuzzy sets in a nite universe X are nite. In the remaining of this chapter,X always denotes a nite universe.

  • 218 K. Bosteels and E.E. Kerre

    Fuzzy Aggregation Operators

    Denition 8. A fuzzy aggregation operator H of arity n N \ {0} is anincreasing [0, 1]n [0, 1] mapping that satises H(0, 0, . . . , 0) = 0 andH(1, 1, . . . , 1) = 1.Fuzzy aggregation operators of arity n are often said to be n-ary. Binaryfuzzy aggregation operators are operators of arity 2. Also, note that we cannaturally extend the usual order on R to a partial order on fuzzy aggregationoperators. Namely, for two n-ary fuzzy aggregation operators H1 and H2,we write H1 H2 if H1(x1, x2, . . . , xn) H2(x1, x2, . . . , xn) holds for allx1, x2, . . . , xn [0, 1].

    Triangular norms and conorms (t-norms and t-conorms for short) are well-known binary fuzzy aggregation operators.

    Denition 9. An associative and commutative binary fuzzy aggregation op-erator T is called a t-norm if it satises T (x, 1) = x for all x [0, 1].Denition 10. An associative and commutative binary fuzzy aggregation op-erator S is called a t-conorm if it satises S(x, 0) = x for all x [0, 1].Each t-norm T is a conjunctor, and each t-conorm S is a disjunctor. Hence,they can be used to model the fuzzy intersection and union. More precisely,their pointwise extensions can be used for this:

    Denition 11. The pointwise extension H of a binary fuzzy aggregation op-erator H is dened as, for all A,B F(X) and x X:

    H(A,B)(x) = H(A(x), B(x)) (5)i.e., A T B = T (A,B) and A SB = S(A,B) for all A,B F(X). Further-more, note that t-norms and t-conorms, as a consequence of their associativ-ity, can easily be generalized to arity n > 2 by recursive application. For arityn = 1, we let each t-norm and t-conorm correspond to the identity mapping.

    The minimum TM is the largest t-norm and the drastic product TD, whichis given by

    TD(x, y) =

    {min(x, y) if max(x, y) = 10 otherwise


    for all x, y [0, 1], is the smallest t-norm, i.e., TD T TM for every t-normT . Other common t-norms are the algebraic product TP and the Lukasiewiczt-norm TL : TP(x, y) = xy and TL(x, y) = max(0, x+y1), for all x, y [0, 1].It can be proven that TL TP. Hence, TD TL TP TM.Denition 12. The dual H of an n-ary fuzzy aggregation operator H isdened as, for all x1, x2, . . . , xn [0, 1]:

    H(x1, x2, . . . , xn) = 1H(1 x1, 1 x2, . . . , 1 xn) (7)

  • Fuzzy Audio Similarity Measures 219

    The dual of a t-norm T is a t-conorm T , and vice versa. One can easily verifythat T M(x, y) = max(x, y), T

    P(x, y) = x + y x y, T L(x, y) = min(1, x + y)


    T D(x, y) =

    {max(x, y) if min(x, y) = 01 otherwise


    for all x, y [0, 1]. The ordering is as follows: T M T P T L T D.

    Fuzzy Similarity Measures

    Denition 13. A fuzzy comparison measure is a binary fuzzy relation onF(X), i.e., a fuzzy set in F(X)F(X).We consider the following properties of a fuzzy comparison measure M [11]:

    M(A,B) = 1 = A = B (reexive)M(A,B) = 1 = A = B (coreexive)M(A,B) = 1 = A B B A (strong reexive)M(A,B) = 1 = A B B A (weak coreexive)M(A,B) = 1 = A B (inclusive)M(A,B) = 1 = A B (coinclusive)M(A,B) = 0 = A T B = (T -exclusive)M(A,B) = 0 = A T B = (T -coexclusive)M(A,B) = M(B,A) (symmetric)M(A,B) = M(A/suppA, B/suppA) (left-restrictable)M(A,B) = M(A/suppB , B/suppB) (right-restrictable)M(A,B) M(A/suppA, B/suppA) (weak left-restrictable)M(A,B) M(A/suppB , B/suppB) (weak right-restrictable)

    for all A,B F(X), with T a t-norm and C/Y , for C F(X), the restrictionof C to Y X, i.e., C/Y is the Y [0, 1] mapping that associates C(x) witheach x Y .Denition 14. We call a fuzzy comparison measure a fuzzy similarity mea-sure if it is reexive.

    Denition 15. We call a fuzzy similarity measure a fuzzy inclusion measureif it is both inclusive and coinclusive.

    Denition 16. We call a fuzzy similarity measure a fuzzy resemblance mea-sure if it is symmetric.

  • 220 K. Bosteels and E.E. Kerre

    4 A Triparametric Family of Fuzzy Similarity Measures

    In [11], we introduced a triparametric family of cardinality-based fuzzy sim-ilarity measures. All measures in this family are instances of a general formthat depends on three parameters:

    Denition 17. Let be a binary fuzzy aggregation operator, and let 1 and2 be [0, 1]3 R mappings that are increasing in their rst and second argu-ment. The general form M1,2 is given by:

    M1,2(A,B) =1(|| (A,A)||, || (B,B)||, || (A,B)||)2(|| (A,A)||, || (B,B)||, || (A,B)||)


    for all A,B F(X), with ||.|| the relative sigma count, i.e., ||A|| = |A|/|X|for each A F(X).We proved in the same paper that the following theorems hold:

    Theorem 1. Let be an arbitrary fuzzy aggregation operator, and let 1and 2 be [0, 1]3 R mappings that are increasing in their rst and secondargument. The following implications hold:2

    (x, y, z [0, 1])(0 1(x, y, z) 2(x, y, z))= M1,2 is [0, 1]-valued


    (x [0, 1])(1(x, x, x) = 2(x, x, x)) = M1,2 is reexive (11)

    Theorem 2. Let T be an arbitrary t-norm, and let 1 and 2 be [0, 1]3 Rmappings that are increasing in their rst and second argument. The followingimplications hold:

    (x, y, z [0, 1])(min(x, y) z max(x, y) =1(x, y, z) = 2(x, y, z)) = MT1,2 is strong reexive


    (x, y, z [0, 1])(x z y = 1(x, y, z) = 2(x, y, z))= MT1,2 is inclusive


    (x, y [0, 1])(1(x, y, 0) = 0) = MT1,2 is T -exclusive (14)(x, y, z [0, 1])(1(x, y, z) = 1(y, x, z)

    2(x, y, z) = 2(y, x, z)) = MT1,2 is symmetric(15)

    (x, z [0, 1])(u, v [0, 1])(1(x, u, z) = 1(x, v, z) 2(x, u, z) = 2(x, v, z)) = MT1,2 is left-restrictable


    ( y, z [0, 1])(u, v [0, 1])(1(u, y, z) = 1(v, y, z) 2(u, y, z) = 2(v, y, z)) = MT1,2 is right-restrictable


    2 A mapping f from a set D to R is [0, 1]-valued if 0 f(d) 1 for all d D.

  • Fuzzy Audio Similarity Measures 221

    (x, z [0, 1])(u, v [0, 1])(1(x, u, z) = 1(x, v, z))= MT1,2 is weak left-restrictable


    ( y, z [0, 1])(u, v [0, 1])(1(u, y, z) = 1(v, y, z))= MT1,2 is weak right-restrictable


    Theorem 3. Let 1 and 2 be [0, 1]3 R mappings that are increasing intheir rst and second argument. The following implications hold:

    (x, y, z [0, 1])(z x y = 0 1(x, y, z) 2(x, y, z))= MTP1,2 is [0, 1]-valued


    (x, y, z ]0, 1])(z x y = 1(x, y, z) > 0)= MTP1,2 is TP -coexclusive


    Theorem 4. Let 1 and 2 be [0, 1]3 R mappings that are increasing intheir rst and second argument. The following implications hold:

    (x, y, z [0, 1])(z min(x, y) = 0 1(x, y, z) 2(x, y, z))= MTM1,2 is [0, 1]-valued


    (x, y, z [0, 1])(z < max(x, y) z min(x, y) =1(x, y, z) < 2(x, y, z)) = MTM1,2 is coreexive


    (x, y [0, 1])(1(x, y,min(x, y)) = 2(x, y,min(x, y)))= MTM1,2 is strong reexive


    (x, y, z [0, 1])(z < min(x, y) = 1(x, y, z) < 2(x, y, z))= MTM1,2 is weak coreexive


    (x, y [0, 1])(x y = 1(x, y, x) = 2(x, y, x))= MTM1,2 is inclusive


    (x, y, z [0, 1])(z < x z y = 1(x, y, z) < 2(x, y, z))= MTM1,2 is coinclusive


    (x, y, z ]0, 1])(z min(x, y) = 1(x, y, z) > 0)= MTM1,2 is TM-coexclusive


    For this chapter, we restrict ourselves to the fuzzy similarity measureslisted in Table 1. As indicated in the second and third column, all of thesemeasures are members of the above-mentioned family. It is not hard to seethat the antecedent of (20) is not satised for the parameters of MTP1 , M

    TP2 ,

    MTP3 and M

    TP11 . Therefore, we omitted the expressions of these measures.

    Furthermore, note that we used the equality |ATMB| = |A|+ |B| |ATMB|to shorten some of the expressions.

    Using Theorems 14, we can prove properties of the considered fuzzysimilarity measures. Table 2 indicates which properties can be proven in

  • 222 K. Bosteels and E.E. Kerre

    Table 1. The considered cardinality-based fuzzy similarity measures

    1(x, y, z) 2(x, y, z) = TM = TP

    M1 z x|A TM B|

    |A| n/a

    M2 z y|A TM B|

    |B| n/a

    M3 z min(x, y)|A TM B|

    min(|A|, |B|) n/a

    M4 z

    x y|A TM B||A| |B|

    |A TP B||A TP A| |B TP B|

    M5 zx + y


    2 |A TM B||A|+ |B|

    2 |A TP B||A TP A|+ |B TP B|

    M6 z max(x, y)|A TM B|

    max(|A|, |B|)|A TP B|

    max(|A TP A|, |B TP B|)

    M7 z x + y z|A TM B||A TM B|

    |A TP B||A TP A|+ |B TP B| |A TP B|

    M8 min(x, y) x + y zmin(|A|, |B|)|A TM B|

    min(|A TP A|, |B TP B|)|A TP A|+ |B TP B| |A TP B|


    x y x + y z|A| |B||A TM B|

    |A TP A| |B TP B|

    |A TP A|+ |B TP B| |A TP B|

    M10x + y

    2x + y z |A|+ |B|

    2 |A TM B||A TP A|+ |B TP B|

    2 (|A TP A|+ |B TP B| |A TP B|)

    M11 max(x, y) x + y zmax(|A|, |B|)|A TM B|


    this way. We refer to [11] for some example proofs. The advantage of the(weak) restrictable fuzzy similarity measures will be explained further in thischapter. For the other properties, we do not elaborate on their practical use.However, it does not take a genius to see that, depending on the intendedapplication, these remaining properties can be important as well.

    5 Fuzzy Audio Similarity Measures

    5.1 In General

    Henceforth, let F denote the set of all possible audio fragments.

    Denition 18. An audio similarity measure is a F F R mapping thatassociates with each pair of audio fragments a real number that represents thesimilarity between these fragments.

  • Fuzzy Audio Similarity Measures 223
















































































































  • 224 K. Bosteels and E.E. Kerre

    We use the notation M for the set of all audio similarity measures. As explainedpreviously in this chapter, an audio similarity measure usually consists of twostages. First, a F Rd mapping is used to extract an d-dimensional featurevector, with d N \ {0}, from each audio fragment, and then the similaritybetween the two feature vectors is computed by means of a Rd Rd Rmapping.

    Denition 19. A fuzzy audio similarity measure is a binary fuzzy relationon F, i.e., a F F [0, 1] mapping, that associates with each pair of audiofragments a degree of similarity.

    Thus, F(F F) is the set of all fuzzy audio similarity measures. Obviously,we have F(F F) M.

    5.2 Based on SHs and FPs

    Recall that a fuzzy similarity measure is a F(X) F(X) [0, 1] mapping.Hence, if we can identify the feature vectors with fuzzy sets, then a fuzzysimilarity measure can be used to implement the similarity measurement stageof a fuzzy audio similarity measure. We use this approach to construct fuzzyaudio similarity measures based on SHs and FPs. More precisely, we considerthe fuzzy audio similarity measures that compare normalized SHs and FPsusing one of the fuzzy similarity measures listed in Table 1.


    Since SHs and FPs consist of values from [0,+[, they can be converted tofuzzy sets by means of normalization, i.e., dividing each value by the maximumvalue. In practice, normalization is not always required. Namely, one can easilyverify that normalization is not necessary if the fuzzy similarity measure Msatises

    M(A,B) = M(a A, b B) (29)for all A,B F(X) and a, b ]0,+[, with cC, for (c, C) ]0,+[F(X),the X [0,+[ mapping dened by: (c C)(x) = c C(x), for all x X.It can easily be proven that (29) holds for MTP4 . All other considered fuzzysimilarity measures do not satisfy (29). However, if the feature vectors havethe same maximum value, then it is sucient that the fuzzy similarity measureM satises

    M(A,B) = M(a A, a B) (30)for all A,B F(X) and a ]0,+[. Most of the considered fuzzy similaritymeasures satisfy (30), but unfortunately it is not often the case that themaximum values of the feature vectors are equal in practice. In particular,this is generally not true for SHs and FPs.

  • Fuzzy Audio Similarity Measures 225

    Restricting Computation Time

    In Fig. 1a, white and black depict zero and the maximum value, respectively.Hence, we identify this SH with a fuzzy set A by interpreting black as oneand white as zero. Since a large portion of the gure is white, suppA willcontain considerably less elements than X. This will be the case for most SHs,because the higher loudness levels are rarely reached. When restrictable fuzzysimilarity measures are used for comparing such fuzzy sets, we can restrictthe computation time. For instance, we would normally calculate

    xXmin(A(x), B(x))



    to determine the value of MTM1 for A,B F(X). However, since MTM1 isleft-restrictable, we obtain the same value by calculating

    x suppAmin(A(x), B(x))

    x suppA


    The latter form requires |suppA| comparisons and 2 (|suppA| 1) additions,while the former form needs |X|+2 (|X| 1) calculations. Hence, the latterform can be calculated substantially faster when suppA contains considerablyless elements than X.

    Weak restrictable fuzzy similarity measures can also reduce the compu-tation time in practical applications. For instance, when searching for audiofragments that are very similar to a reference fragment by comparing SHswith the weak left-restrictable measure MTM7 , we can rst calculate the up-per bound MTM7 (A/suppA, B/suppA). Since we need to nd high similarities inthis case, there is no need to do the extra computations required to determineM

    TM7 (A,B) when the upper bound M

    TM7 (A/suppA, B/suppA) is small. More

    concretely, we only need to calculate the right term in the numerator anddenominator of

    xsuppAmin(A(x), B(x)) +


    min(A(x), B(x))


    max(A(x), B(x)) +

    xX\suppAmax(A(x), B(x))


    if the quotient of the left terms is large enough. In this way, the computationtime can be reduced substantially when there are a lot audio fragments thatare only slightly similar to the reference fragment.

  • 226 K. Bosteels and E.E. Kerre

    6 Experimental Results and Discussion

    6.1 Evaluation

    We evaluate the performance of a given audio similarity measure by examiningthe ordering generated by it when we use it to arrange the audio fragmentsof a test collection according to their similarity with a reference fragment.Formally, we dene a test collection as follows:

    Denition 20. A test collection is a couple (F,), with F PF(F) and an equivalence relation on F modelling is very similar to.

    We use the notation T for the set of all possible test collections. Now, supposethat (F,) T. For a reference fragment a F and an audio similaritymeasure M , we can then use the normalized average rank (NAR) [12] toevaluate the ordering of the elements of F according to their similarity witha, generated by M :

    Denition 21. The normalized average rank is the TM F [0, 1] map-ping NAR given by:

    NAR((F,),M, a) = eval(ranks((F,),M, a))|F | (34)

    for all ((F,),M, a) T M F, with eval the PF(N \ {0}) R mappingsuch that, for all N PF(N \ {0}),

    eval(N) =1|N |




    |N |




    and ranks the TM F PF(N \ {0}) mapping given by:ranks((F,),M, a) = {rankF,M,a(b) | b F a b} (36)

    for all ((F,),M, a) TM F, where rankF,M,a is the F N \ {0} map-ping that associates with each fragment in F its rank number in the orderingaccording to the similarity with a, generated by M .

    The NAR is 0 for perfect performance, and approaches 1 as the performanceworsens. For instance, suppose that F = {a1, a2, b1, b2} is a set of audio frag-ments such that {a1, a2} and {b1, b2} are the equivalence classes of very similarfragments, i.e., is the equivalence relation on F that satises a1 a2 andb1 b2. Now, let M be a fuzzy audio similarity measure that generates thefollowing values:

    M a1 a2 b1 b2a1 1 0.9 0.3 0.5a2 0.9 1 0.4 0.8b1 0.3 0.4 1 0.7b2 0.5 0.8 0.7 1

  • Fuzzy Audio Similarity Measures 227

    We then obtain the sequence (a1, a2, b2, b1) if we order the elements of Faccording to their similarity with a1, i.e., according to the values that Mgenerates for {a1} F = {(a1, a1), (a1, a2), (a1, b1), (a1, b2)}. Hence,

    ranks((F,),M, a1) = {1, 2} (37)

    and thus

    NAR((F,),M, a1) = (1 + 2) (1 + 2)4 2 = 0 (38)The NAR is 0 in this case because the obtained ordering is perfect, i.e., allfragments that are very similar to a1 are placed up front. Similarly, we haveNAR((F,),M, a2) = NAR((F,),M, b1) = 0. For b2, however, we get

    ranks((F,),M, b2) = {1, 3} (39)

    and hence

    NAR((F,),M, b2) = (1 + 3) (1 + 2)4 2 = 0.125 (40)

    In this case, the NAR is larger than 0 since a2 is placed before b1 when M isused to order the elements of F according to their similarity with b2.

    Since the NAR can vary a lot for dierent reference audio fragments, wecalculate the global NAR (GNAR), which is the arithmetic mean of all NARs:

    Denition 22. The global normalized average rank is the T M [0, 1]mapping GNAR given by:

    GNAR((F,),M) = 1|F |aF

    NAR((F,),M, a) (41)

    for all ((F,),M) TM.The smaller the GNAR, the better the performance. For example, the GNARfor the F , and M considered in the above-mentioned example is equal to(0 + 0 + 0 + 0.125)/4 = 0.03125. This indicates that, for the audio fragmentsin F , the performance of M is very good, but not perfect.

    6.2 Test Collection

    We used the BEPOP test collection for this chapter.3 This collection consistsof samples of 128 songs that recently appeared in a Belgian hitlist. We ex-tracted three fragments of nine seconds from each sample. Fragments of thesame sample (and hence the same song) are considered very similar, i.e., a bholds for two audio fragments a and b if a and b are fragments from the samesample.3 http://users.ugent.be/klbostee/bepop.

  • 228 K. Bosteels and E.E. Kerre

    6.3 Results

    Figure 3 shows the results of our experiments. We compared the consideredfuzzy audio similarity measures with the Euclidean distance d between the SHsor FPs, interpreted as 1,800-dimensional vectors. Moreover, we also evaluatedthe performance of the Euclidean distance between normalized SHs or FPs.This normalized Euclidean distance is denoted by d. The dierence betweenthe performance of d and d turns out to be very small. Namely, d performsslightly worse. Hence, we do not gain anything by normalizing the SHs orFPs before taking the Euclidean distance. However, normalized SHs or FPsclearly lead to better results when we compare them with MTP4 , and theperformance of some of the remaining fuzzy similarity measures is similar tothe performance of d.

    Overall, we see that FPs tend to perform better than SHs. A possibleexplanation for this observation is that SHs contain less information, since thehigher loudness levels are rarely reached. Also, rhythm might be more usefulthan timbre to discriminate the songs in the BEPOP collection. Concerningthe choice of scales, there does not appear to be an overall tendency. ForM

    TP4 , however, it is clear that the Mel spectrogram leads to slightly better

    performance than the sonogram.To conclude this section, we explain why using TP instead of TM appears

    to magnify the performance, i.e., TP leads to better performance when TMperforms well, and even worse performance when TM performs badly. Thisobservation can be attributed to the fact that TM is noninteractive. For in-stance, consider the fuzzy sets shown in Fig. 2. In this gure, A, B and C arenormalized FPs. A and B were both derived from a fragment of Youre beau-tiful by James Blunt, while C corresponds with a fragment of Rudebox byRobbie Williams. Hence, A and B are more similar than A and C. Because ofthe noninteractivity of TM, however, we have |ATMB| |ATMC|, and thusM

    TM4 (A,B) < M

    TM4 (A,C) since |A| = |A| and |B| > |C|. Hence, MTM4 gives

    counterintuitive results in this case because there is practically no dierence

    Fig. 2. Example that illustrates the dierence between MTM4 and M

    TP4 . White

    depicts 0, and black represents 1

  • Fuzzy Audio Similarity Measures 229






    ed S


    el s










































































  • 230 K. Bosteels and E.E. Kerre

    between ATMB and ATMC, as a consequence of the noninteractivity of TM.For TP, however, we can quite clearly notice a dierence between ATPB andA TP C in Fig. 2. In fact, we have |A TP B| > |A TP C|. This compensates|B TP B| > |C TP C| so that M

    TP4 (A,B) > M

    TP4 (A,C).

    7 Conclusion

    The BEPOP test collection is quite small, and hence we have to be carefulwhen we base conclusions on it. Nevertheless, our experiments do indicatethat fuzzy similarity measures can perform as well as, or even better than,the Euclidean distance for comparing SHs or FPs. In particular, we noticedthat MTP4 is very suitable for this task. Moreover, this measure does notrequire normalization, and its computation time can be restricted in certainpractical applications since it is weak left- and right-restrictable.

    Actually, it is not that surprising that MTP4 performs well. After all, onecan easily verify that it corresponds with the cosine similarity measure, whichhas already been used successfully for comparing other types of feature vec-tors (e.g. [2]), apart from the fuzzy framework. However, we explained in theprevious section that MTP4 can be regarded as an improved version of M

    TM4 ,

    and that other fuzzy similarity measures can be improved in the same way.This general insight can be considered to be more important than the absoluteperformance of the constructed audio similarity measures.

    8 Future Work

    We have only scratched the surface of the extensive range of possibilities thatarise when audio feature vectors are identied with fuzzy sets. Obviously,investigating the use of other feature vectors and other fuzzy similarity mea-sures is a possible direction of future research. Furthermore, it would be veryinteresting to examine the inuence of the properties of the fuzzy similaritymeasures on the performance of the corresponding audio similarity measures.In any case, it should be worthwhile to conduct more elaborate experimentsto analyse the performance of the obtained fuzzy audio similarity measures.


    1. Aucouturier J J, Pachet F (2002) Music similarity measures: Whats the use?In: Proceedings of the ISMIR International Conference on Music InformationRetrieval

    2. Cooper M, Foote J (2002) Automatic music summarization via similarity analy-sis. In: Proceedings of the ISMIR International Conference on Music InformationRetrieval

  • Fuzzy Audio Similarity Measures 231

    3. Pampalk E (2006) Computational models of music similarity and their applica-tion in music information retrieval. PhD thesis, Vienna University of Technology

    4. Pampalk E, Dixon S, Widmer G (2003) Exploring music collections by browsingdierent views. In: Proceedings of the ISMIR International Conference on MusicInformation Retrieval

    5. Pampalk E, Rauber A, Merkl D (2002) Content-based organization and visual-ization of music archives. In: Proceedings of the ACM International Conferenceon Multimedia, 570579

    6. Rauber A, Pampalk E, Merkl D (2003) Journal of New Music Research 32:193210

    7. Logan B, Salomon A (2001) A music similarity function based on signal analy-sis. In: Proceedings of the International Conference on Multimedia and Expo,745748

    8. Mandel M, Ellis D (2005) Song-level features and support vector machines formusic classication. In: Proceedings of the ISMIR International Conference onMusic Information Retrieval

    9. Pampalk E, Dixon S, Widmer G (2003) On the evaluation of perceptual simi-larity measures for music. In: Proceedings of the International Conference onDigital Audio Eects, 712

    10. Pampalk E (2003) A Matlab toolbox to compute music similarity from audio.In: Proceedings of the ISMIR International Conference on Music InformationRetrieval

    11. Bosteels K, Kerre E E (2007) Fuzzy Sets and Systems 158(22):2466247912. Muller H, Muller W, McG Squire D, Marchand-Maillet S, Pun T (2001) Pattern

    Recognition Letters 22:593601

  • Fuzzy Techniques for Text Localisationin Images

    Przemyslaw Gorecki1, Laura Caponetti2, and Ciro Castiello2

    1 Department of Mathematics and Computer Science, University of Warmiaand Mazury, ul. Oczapowskiego 2, 10-719 Olsztyn, Polandpgorecki@matman.uwm.edu.pl

    2 Department of Computer Science, University of Bari, via E. Orabona, 4-70125Bari, Italylaura@di.uniba.it, castiello@di.uniba.it

    Summary. Text information extraction represents a fundamental issue in the con-text of digital image processing. Inside this wide area of research, a number ofspecic tasks can be identied ranging from text detection to text recognition. Inthis chapter, we deal with the particular problem of text localisation, which aims atdetermining the exact location where the text is situated inside a document image.The strict connection between text localisation and image segmentation is high-lighted in the chapter and a review of methods for image segmentation is proposed.Particularly, the benets coming from the employment of fuzzy and neuro-fuzzytechniques in this eld is assessed, thus indicating a way to combine ComputationalIntelligence methods and document image analysis. Three peculiar methods basedon image segmentation are presented to show dierent applications of fuzzy andneuro-fuzzy techniques in the context of text localisation.

    1 Introduction

    Text information represents a very important component among the contentsof a digital image. This kind of information is related to the category usuallyreferred to as semantic content. By contrast with perceptual content, relatedto low-level characteristics including colour, intensity or texture, semanticcontent involves recognition of components, such as text, objects or graphicsinside a document image [13]. The importance of achieving text informa-tion by means of image analysis is straightforward. In fact, text can be usedto describe the content of a document image, can be converted into elec-tronic formats (for memorisation and archiving purposes), can be exploitedto ultimately understand documents, thus enabling a plethora of applicationsranging from document indexing to information extraction and automaticannotation of documents [46]. Additionally, with the increasing use of webdocuments, a lot of multimedia content is available having dierent page rep-resentation forms, which do not lend easily to automatic analysis. Text standsP. Gorecki et al.: Fuzzy Techniques for Text Localisation in Images, Studies in Computational

    Intelligence (SCI) 96, 233270 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 234 P. Gorecki et al.

    as the most appropriate medium for allowing a suitable analysis of such con-texts, with additional benets deriving from possible conversions into othermultimedia modalities (such as voice signal), or representations in naturallanguage of the web page contents. The recognition of text in images is a steptowards achieving such a representation [7].

    The presence of text inside a digital image can be characterised by dierentproperties: text size, alignment, spacing and colour. In particular, text canexhibit varying size, being text dimension an information which cannot be apriori assumed. Also, text alignment and text spacing are relevant propertiesthat can variegate a document appearance in several ways and presumptionsabout horizontal alignment of text can be made only when specic contextsare investigated. Usually text characters tend to have the same (or similar)colours inside an image, however the chromatic visualisation may representsa fundamental property, specially when contrasting colours are employed toenhance text among other image regions.

    Automatic methods for text information extraction have been investigatedin a comprehensive way, in order to dene dierent mechanisms that, startingfrom a digital image, could ultimately derive plain text to be memorised orprocessed. By loosely referring to [8], we can dene the following steps corre-sponding to the sequential sub-problems which characterise the general textinformation extraction task:

    Text Detection and Localisation. In some circumstances there is no cer-tainty about the presence of text in a digital image, therefore the textdetection step is devoted to the process of determining whether a textregion is present or not inside the image under analysis. In this phaseno proper text information is derived, but only a boolean response to adetection query. This is common when no a priori knowledge about thecharacteristics of an image is available. Once the presence of the text in-side an image has been assessed, the next step is devoted to determiningthe exact location where the text is situated. This phase is often com-bined with dierent techniques purposely related to the problem of imagesegmentation, thus conguring text regions as specic components to beisolated in digital images.

    Text Tracking. Text tracking represents a support activity correlated tothe previously described step of text localisation whenever the task of textinformation extraction is performed over motion images (such as videos).Even if this kind of process has been frequently overlooked in literature,it could prove its usefulness also to verify the results of the text detectionand localisation steps or to shorten their processing times.

    Text Recognition and Understanding. Text recognition represents the ul-timate step when analysing a digital image with the aim of derivingplain text to be stored or processed. This phase is commonly carried outby means of specic Optical Character Recognition (OCR) technologies.Moreover, text understanding aims to classify text in logical elements, suchas headings, paragraphs, and so on.

  • Fuzzy Techniques for Text Localisation in Images 235

    In this chapter, we are going to address localisation step; the interested readercan be referred to a number of papers directly devoted to the analysis of theother sub-problems [915]. Particularly, the additional contribution of thischapter consists in introducing novel text localisation approaches, based onfuzzy segmentation techniques.

    When dealing with text localisation we are particularly involved with theproblem of digital image segmentation. The amount and complexity of infor-mation in the images, together with the process of the image digitalisation,lead to a large amount of uncertainty in the image segmentation process. Theadoption of the fuzzy paradigm is desirable in image processing because ofthe uncertainty and imprecision present in images, due to noise, image sam-pling, lightning variations and so on. Fuzzy theory provides a mathematicaltool to deal with the imprecision and ambiguity in an elegant and ecientway. Fuzzy techniques can be applied to dierent phases of the segmentationprocess; additionally, fuzzy logic allows to represent the knowledge about thegiven problem in terms of linguistic rules with meaningful variables, which isthe most natural way to express and interpret information.

    The rest of the chapter is organised as follows. Section 2 is devoted to thepresentation of a brief review of methods for image segmentation, proposingdierent lines of categorisation. Section 3 introduces some concepts related tofuzzy and neuro-fuzzy techniques, discussing their usefulness in the eld ofdigital image processing. Specically, the particular model of a neuro-fuzzysystem is illustrated: its formalisation is useful for the subsequent presentationcarried on in Sect. 4, where three peculiar text localisation approaches arereported for the sake of illustration. In Sect. 5 the outcomes of experimentalresults are reported and discussed. Section 6 closes the chapter with someconclusive remarks.

    2 A Categorisation of Image Segmentation Approaches

    Image segmentation is widely acknowledged to play a crucial role in manycomputer vision applications and its relevance in the context of the text lo-calisation process has been already mentioned. In this section we are going todiscuss this peculiar technique in the general eld of document image analysis.

    Image segmentation represents the rst step of document image analysis,with the objective of partitioning a document image into some regions of in-terest. Generally, in this context, image segmentation is also referred to aspage segmentation. High level computer vision tasks, related with text in-formation extraction, often utilise information about regions extracted fromdocument pages. In this sense, the nal purpose of page segmentation is toclassify dierent regions in order to discriminate among text and non-textareas1. Moreover image segmentation is critical, because segmentation results1 Non-text regions may be distinguished as graphics, pictures, background, and soon (in according with the requirements of the specic problem context).

  • 236 P. Gorecki et al.

    will aect all subsequent steps of image analysis. In recent years image seg-mentation techniques have been variously applied for the analysis of dierenttypes of documents, with the aim of text information extraction [1621].

    Closely related to image segmentation is the problem of feature extraction.The goal is to extract the most salient characteristics of an image for the pur-pose of its segmentation: an eective set of features is one of the requirementfor successful image segmentation. Information in the image, coded directly inpixel intensities, is highly redundant: the major problem here is the number ofvariables involved. Direct transformations of an image f(x, y) of size M Nto a point in a (M N)-dimensional space is impractical, due to the numberof dimensions involved. To solve this problem, the image representation mustbe simplied by minimising the number of dimensions needed to describe theimage or some part of it. Therefore, a set of features is extracted from a re-gion of interest in the image. It is common in literature to distinguish betweennatural features, dened by the visual appearance of the image (i.e. intensityof a region), and articial features, such as intensity histograms, frequencyspectra, or co-occurrence matrices [22]. Moreover, rst order statistical fea-tures, second-order statistics, and higher-order statistics can be distinguished,depending on the number of points dening the local feature [23, 24]. In therst case, features convey information about intensity distributions, while inthe second case, information about pixel pairs is exploited in order to takeinto account spatial information of the distribution. In the third case, morethan two pixels are considered. The second-order and higher-order featuresare especially useful in describing texture, because they can capture relationsin the repeating patterns, that dene visual appearance of a texture.

    There is no single segmentation method that provides acceptable resultsfor every type of images. General methods exist, but those which are designedwith particular images often achieve better performance by utilising a priorknowledge about the problem. For our purposes, we are going to discuss pe-culiar segmentation methods by considering two distinct lines of classication(a diagram of the proposed categorisation is reported in Fig. 1). On the one






    Fig. 1. The categorisation of the image segmentation approaches

  • Fuzzy Techniques for Text Localisation in Images 237

    hand, by referring to the working mechanism of the segmentation approaches,it is possible to distinguish three classes: top-down approaches, bottom-up ap-proaches and hybrid approaches. Top-down algorithms start from the wholedocument image and iteratively subdivide it into smaller regions (blocks).The subdivision is based on a homogeneity criterion: the splitting procedurestops when the criterion is met and blocks obtained at this stage constitutethe nal segmentation result. Some examples of top-down algorithms are re-ported in [25, 26]. Bottom-up algorithms start from document image pixelsand cluster the pixels into connected components (such as characters). Theprocedure can be iterated giving rise to a growing process which adjoins un-connected adjacent components, in order to cluster higher-order components(such as words, lines, document zones). Typical bottom-up algorithms can befound in [2730]. Hybrid algorithms can be regarded as a mix of the previousapproaches, thus conguring a procedure which involves both splitting andmerging phases. Hybrid algorithms have been proposed in [3133].

    The second line of classication to categorise segmentation approachesrelies on the features utilised during the process. Methods can be categorisedinto region-based methods, edge-based methods and texture-based methods.In the rst case properties such as intensity or colour are used to derive a set offeatures describing regions. Edge-based and texture-based methods, instead,derive a set of local features, concerning not only the analysis of a singlepixel, but also its neighbourhood. In particular, the observation that imagetext regions have textural properties dierent from background or graphicsrepresents the foundation of texture-based methods. In the following sectionswe discuss in more details the above reported segmentation methods.

    2.1 Region-Based Methods

    Region-based methods for image segmentation use the colour or grey scaleproperties in a region; when text regions are to be detected, their dierenceswith the corresponding properties of the background can be highlighted forthe purpose of text localisation.

    The key for region-based segmentation consists in rstly devising suitablemethods for partitioning an image in a number of connected components, ac-cording to some specic homogeneity criteria to be applied during the imagefeature analysis. Once obtained the initial subdivision of the image into a gridof connected regions, an iterative grouping process of similar regions is startedin order to update the partition of the image. In this way, it is possible to createa nal segmentation of regions which are meant to be purposely classied. Itshould be observed that the term grouping is used here in a loose sense.We intend to address a process which could originate an incremental or decre-mental assemblage of regions, with reference to region growing (bottom-up)methods, region splitting (top-down) methods and split-and-merge (hybrid)methods.

  • 238 P. Gorecki et al.

    The analysis of the image features can be performed on the basis of dif-ferent techniques: among them, thresholding represents one of the simplestmethods for segmentation. In some images, an object can be easily separatedfrom the background if the intensity levels of the object fall outside the rangeof intensity levels of the background. This represents a perfect case for apply-ing a thresholding approach. Each pixel of the input image f(x, y) is comparedwith the threshold t in order to produce the segmented image l(x, y):

    l(x, y) ={

    1 if f(x, y) > t (object),0 if f(x, y) t (background). (1)

    The selection of an appropriate threshold value is essential in this tech-nique. Many authors have proposed to nd the threshold value by means of animage histogram shape analysis [3437]. Global thresholding techniques use axed threshold for all pixels in the image and therefore work well only if theintensity histogram of the objects and background are well separated. Hence,these kind of techniques cannot deal with images containing, for example, astrong illumination gradient. On the other hand, local adaptive thresholdingselects an individual threshold for each pixel based on the range of inten-sity values in its local neighbourhood. This allows for thresholding of an im-age whose global intensity histogram does not contain distinctive peaks [38].Thresholding approach has been successfully applied in many image segmen-tation problems with the goal of text localisation [3941].

    Clustering can be seen as a generalisation of the thresholding technique.In fact, it allows for partitioning data into more than two clusters dealingwith a space of higher dimensionality than thresholding, where data are one-dimensional. Similarly to thresholding, clustering is performed in the imagefeature space, and it aims at nding structures in the collection of data, so thatdata can be classied into dierent groups (clusters). More precisely, data arepartitioned into dierent subsets and data in each subset are similar in someway. During the clustering process, structures in data are discovered withoutany a priori knowledge and without providing an explanation or interpretationwhy they exist [42]. Clustering techniques for image segmentation have beenadopted for the purpose of text localisation [4345].

    2.2 Edge-Based Methods

    Edge-based techniques, rather than nding regions by adopting a groupingprocess, aim at identifying explicit or implicit boundaries between regions.

    Edge-based methods represent the earliest segmentation approaches andrely on the process for edge detection. The goal of edge detection is to localisethe points in the image where abrupt changes in intensity take place. In thedocument images, edges may appear on discontinuity points between the textand the background. The simplest mechanism to detect edges is the dierentialdetection approach. As the images are two-dimensional, the gradient iscalculated from the partial derivatives of the image f(x, y):

  • Fuzzy Techniques for Text Localisation in Images 239

    f(x, y) =[



    ]. (2)

    The computations of the partial derivatives are usually realised by convolvingthe image with a given lter, which estimates the gradient. The maps ofedge points obtained at the end of this process can be successively utilisedby an edge tracking technique, so that the contour of dierent regions maybe highlighted inside the image. Generally, the Canny operator, one of themost powerful edge lter, can be applied to detect edge points in documentimages [46].

    In case of text localisation, edge-based methods aim at exploiting the highcontrast between the text and the background. The edges of text boundaryare identied and merged and then several heuristics are used to lter out thenon-text regions [4749].

    2.3 Texture-Based Methods

    Texture-based methods consider a document image as a composite of tex-tures of dierent classes. With this approach various texture segmentationand classication techniques can be used directly or with some modications.Some texture segmentation approaches apply splitting and merging or clus-tering methods to the feature vectors computed for the image and describingits texture information. When a document image is considered as texture,text regions are assumed to have texture features dierent from the non-textones. Text regions are modelled as regular periodic textures, because theycontain text lines with the same orientation. Also their interline spacings areapproximately the same. Instead non-text regions correspond to irregular tex-tures. Generally, the problem is how to separate two or more dierent textureclasses. Techniques based on Gabor lters, Wavelet, FFT, spatial variance canbe used to detect the textural properties of an image text region [5052]. Inthe following, we describe two fundamental approaches as Gabor ltering andmulti-scale techniques.

    Gabor Filtering

    Gabor ltering is a classical approach to describe textural properties of an im-age. A two-dimensional Gabor lter is a complex sinusoid (with a wavelength and a phase oset ) modulated by a two-dimensional Gaussian function(with an aspect ratio of ). The Gabor lter, that has an orientation , isdened as following:

    G(x, y) = exp(x2 + 2y2

    22) cos(2


    + ), (3)

    where x = x cos + y sin and y = x sin + y cos .

  • 240 P. Gorecki et al.

    In the context of text extraction, a lter bank consisting of severalorientation-selective 2-D Gabor lters can be used to detect texture featuresof text and non-text components. As an illustrative example, in [53] the Gabortransform with m dierent spatial frequencies and p dierent orientations isapplied to the input image by producing mp ltered images. A texture featureis computed as the mean value in small overlapping windows centred at eachpixel. The values of each pixel in n features images form an n-dimensional fea-tures vector. These vectors are grouped into K clusters using a squared-errorclustering algorithm.

    Multi-Scale Techniques

    One problem associated with document texture based approaches is due toboth large intra-class and inter-class variations in textural features. To solvethis problem multi-scale analysis and features extraction at dierent scaleshave been introduced by some authors [54,55]. In [56], Wavelet decompositionis used to dene local energy variations in the image at several scales. Binaryimage, which is obtained by thresholding the local energy variation, is analysedby connected component-based ltering using geometric attributes such as sizeand aspect ratio. All the text regions, which are detected at several scales, aremerged to give the nal result.

    Wavelet packet analysis is an important generalisation of Wavelet analysis[57, 58]. Wavelet packet functions are localisable in space such as Waveletfunctions, but oer more exibility in decomposition of signals. Wavelet packetapproximators are based on translated and scaled Wavelet packet functionsWj,b,k, which are generated from the base function [59], according to thefollowing equation:

    Wj,b,k(t) = 2j/2Wb(2j(t k)), (4)where j is the resolution level, Wb is the Wavelet packet function generatedby scaling and translating a mother Wavelet function, b is the number ofoscillations (zero crossings) of Wb and k is the translation shift. In Waveletpacket analysis, a signal x(t) is represented as a sum of orthogonal Waveletpacket functions Wj,b,k(t) at dierent scales, oscillations and locations:

    x(t) =j



    wj,b,kWj,b,k(t). (5)

    where each wj,b,k is a Wavelet packet coecient. To compute the Waveletpacket coecients a fast splitting algorithm [60] is used, which is an adap-tation of the pyramid algorithm [61] for the discrete Wavelet transform. Thesplitting algorithm diers from the pyramid algorithm by the fact that bothlow-pass (L) and high-pass (H) lters are applied to the detailed coecients,in addition to the approximation coecients, at each stage of the algorithm.Moreover, the splitting algorithm retains all the coecients, including thoseat intermediate ltering stages.

  • Fuzzy Techniques for Text Localisation in Images 241

    The Wavelet packet decomposition process can be represented with aquadtree in which the root node is assigned to the highest scale coecients,that are the original image itself, while the leaves represent outputs of theLL, LH, HL and HH lters. Assuming that similar regions of an image havesimilar frequency characteristics, we infer that these characteristics are cap-tured by some nodes of the quadtree. As a consequence, the proper selectionof quadtree nodes should allow for localisation of similar regions in the image.Learning based methods are proposed for the automatic selection of nodesdescribing text or background as we will illustrate in Sect. 4.3.

    3 Fuzzy Techniques in Image Segmentation

    In the previous section, we have discussed dierent techniques for image seg-mentation. Some of the feature extraction methods and most of the algorithmsare based on crisp relations, comparisons and thresholding. Such constraintsare not well suited to cope with ambiguity and imprecision present in theimages, which are very often degraded by noise coming from various sourcessuch as imperfect capturing devices, image digitalisation and sampling. Fuzzytechniques provide a mathematical tool to deal with such imprecision andambiguities in an elegant and ecient way, allowing to eliminate some of thedrawbacks of classical segmentation algorithms. Additionally, the hybrid ap-proach based on integration of fuzzy logic and neural networks proved to bevery fruitful. This hybridisation strategy allows to combine the benets ofboth methods while eliminating their drawbacks. Neuro-fuzzy networks canbe trained in a similar fashion as classical neural networks, but they are alsocapable of explaining the decision process by representing the knowledge interms of fuzzy rules. Moreover, the rules can be discovered automatically fromdata and their parameters can be easily ne tuned in order to maximise theclassication accuracy of the system.

    Neuro-fuzzy hybridisation belongs to the research eld of ComputationalIntelligence, that is an emerging area in the eld of intelligent systems devel-opment. This novel paradigm results from a partnership of dierent method-ologies: Neural Computation, Fuzzy Logic, Evolutionary Programming. Sucha consortium is employed to cope the imprecision of real world applications,allowing the achievement of robustness, low solution cost and a better rapportwith reality [62,63]. In this section, we introduce the basics of fuzzy theory andneuro-fuzzy hybridisation, while discussing their relevance and application inthe context of image analysis.

    3.1 General Theory of Fuzzy Sets

    The incentive for the development of fuzzy logic originates from observingthat people do not require precise, numerical information in order to describeevents or facts, but rather they do it by using imprecise and fuzzy linguistic

  • 242 P. Gorecki et al.

    terms. Yet, they are able to draw the right conclusions from fuzzy information.The theory of fuzzy sets, underpinning the mechanisms of fuzzy logic, wasintroduced to deal mathematically with imprecise or vague information thatis present in everyday life [64].

    In the bi-valued logic, any relation can be either true or false, which isdened by the crisp criteria of membership. For example, it is easy to deter-mine precisely whether a variable x is greater than a certain number. On theother hand, evaluating whether x is much greater than a certain number isambiguous. In the same way, when looking at a digital document image, wecan say that the background is bright and the letters are dark. We are ableto identify the above classes, despite of the lack of precise denitions for thewords bright and dark: this question relies on the assumption that manyobjects do not have clear criteria of membership. Fuzzy logic allows to handlesuch situations, by introducing continuous intermediate states between trueand false. This allows also to represent numerical variables in terms of linguis-tic labels. Actually, the mean for dealing with such linguistic imprecision isthe concept of fuzzy set, which permits gradual degree of membership of anobject in relation to a set.

    Let X denote a universe of discourse, or space of points, with its elementsdenoted as x. A fuzzy set A is dened as a set of ordered pairs:

    A = {(x, A(x)) | x X}, (6)where A(x) is the membership function of A:

    A : X [0, 1], (7)representing the degree of membership of x in A. A single pair (x, (x)) iscalled fuzzy singleton, thus a fuzzy set can be dened in terms of the union ofits singletons. Based on the above denitions, an ordinary set can be derived byimposing the crisp membership condition A(x) {0, 1}. Graphical examplesof crisp and fuzzy sets are shown in Fig. 2.

    Analogously, it is possible to extend operators of ordinary sets to theirfuzzy counterparts, giving rise to fuzzy extension of relations, denition and

    Fig. 2. An example of a crisp set and a fuzzy set with Gaussian membership function

  • Fuzzy Techniques for Text Localisation in Images 243

    so on [65,66]. In the following, we shall review dierent fuzzy image features,which are employed in the eld of digital image processing. Moreover, we areinterested in dealing with the peculiar aspects of fuzzy clustering and thedenition of fuzzy and neuro-fuzzy systems.

    3.2 Fuzzy Image Features

    An M N image f(x, y) can be represented as an array of fuzzy singletons,denoting pixel grey level intensities. However, due to the imprecise imageformation process, it is more convenient to treat the pixel intensity (or someother image feature, such as edge intensity) as a fuzzy number, having non-singleton membership function, rather than a crisp number (corresponding tothe fuzzy singleton).

    A fuzzy number is a fuzzy set dening a fuzzy interval for a real number,with the membership function that is piecewise continuous. One way for ex-pressing fuzzy numbers is by means of triangular fuzzy sets. A triangular fuzzynumber is dened as A = (a1, a2, a3), where a1 a2 a3 are the numbersdescribing a shape of a triangular membership function:

    A(x) =

    0 x < a1,xa1a2a1 a1 x < a2,1 x = a2,a3xa3a2 a2 < x a3,0 x > a3.


    Fuzzy numbers can be applied to incorporate imprecision into image statis-tics (i.e. histograms). This allows to improve the noise invariance of this kindof features, which is especially important in some situations where the imagestatistics are derived from small regions, so that the number of observationsis small.

    Fuzzy Histogram

    A crisp histogram represents the distribution of pixel intensities in the imageto a certain number of bins, hence it is reports the probability of observinga pixel with a given intensity. In order to obtain the histogram, the intensityvalue of each pixel in the image is accumulated in the bin corresponding to thisvalue. In this way, for an image containing n pixels, a histogram representationH = {h(1), h(2), . . . , h(b)} can be obtained, comprising a number of b bins.Therefore h(i) = ni/n denotes the probability that a pixel belongs to the i-thintensity bin, where ni is the number of pixels in the i-th bin. However, asthe measurements of the intensities are imprecise, each accumulated intensityshould also aect the nearby bins, introducing a fuzziness in the histogram.The value of each bin in a fuzzy histogram represents a typicality of the pixel

  • 244 P. Gorecki et al.

    within the image rather than its probability. The fuzzy histogram can bedened as FH = {fh(1), . . . , fh(b)} where fh(i) is expressed as following:

    fh(i) =n


    j(i), i = 1, . . . , b, (9)

    where b is the number of bins (corresponding to the number of intensity lev-els), n is the number of pixels in the image and j(i) is the membership degreeof the intensity level of the j-th pixel with respect to the i-th bin. Therefore,j(i) denotes the membership function of a fuzzy number, related to the valueof the pixel intensity. The value fh(i) can be expressed as the linear convo-lution between the conventional histogram and the ltering kernel providedby the function j(i). This approach is possible if all fuzzy numbers have themembership function of the same shape. Hence, the membership function l ofa fuzzy number corresponding to a crisp intensity level l, can be expressed asl(x) = (x l), where denotes the general membership function, commonto all fuzzy numbers accumulated in the histogram. By representing as aconvolution kernel, the fuzzy histogram FH = {fh(1), . . . , fh(b)} is smoothedas following:

    fh(i) = (h )(i) =l

    h(i + l)(l), i = 1, . . . , b, (10)

    where h(i) denotes the i-th bin of a crisp histogram.In [67] such a smoothing based approach, where the inuence from neigh-

    bouring bins is expressed by triangular membership functions, has been usedto extract fuzzy histograms of grey images.

    Fuzzy Co-occurrence Matrix

    Fuzzy co-occurrence matrix is another example of fuzzifying the crisp fea-ture measure. Similarly to the second-order statistic, it is often employed formeasuring the texture features of the images. The idea of the classical co-occurrence matrix is to accumulate in the matrix C the co-occurrences ofthe intensity values i = f(xi, yi) and j = f(xj , yj) of the pixels (xi, yi) and(xj , yj), given the spatial oset (x, y) separating the pixels. Therefore, thespatial co-occurrence of the intensities i and j will be accumulated in the binC(i, j) of the matrix, by increasing the value of the bin by one.

    In the case of fuzzy co-occurrence matrix F , intensity vales of pixels (xi, yi)and (xj , yj) are represented with fuzzy numbers having the membership func-tions i(x) and j(x). Thus, not only the bin (i, j) should be incremented,but also its neighbour bins. However, the amount of the increment F (k, l)for the bin F (k, l) should depend on the fullment degrees of membershipfunctions i(k) and j(l) and the increment is calculated as following:

    F (k, l) = i(k)j(l). (11)

  • Fuzzy Techniques for Text Localisation in Images 245

    Similarly to the fuzzy histogram, a fuzzy co-occurrence matrix can beobtained from a crisp co-occurrence matrix by means of the convolution oper-ator. However, as the matrix is two-dimensional, the convolution is performedrst along its rows, and then along its columns.

    3.3 Fuzzy Systems

    Fuzzy systems are designed to cope with imprecision of the input and outputvariables by dening fuzzy numbers and fuzzy sets that can be expressedby linguistic variables. The working scheme of a fuzzy system is based on aparticular inference mechanism where the involved variables are characterisedby a number of fuzzy sets with meaningful labels. For example, a pixel greyvalue can be described using the {bright,grey,dark} fuzzy sets, an edgecan be characterised by the {weak,strong} fuzzy sets, and so on.

    In detail, each fuzzy system is designed to tackle a decision problem bymeans of a set of N fuzzy rules, called fuzzy rule base R. The rules incorporatea number of fuzzy sets whose membership functions are usually designed byexperts in the eld of the problem at hand. The j-th fuzzy rule in a fuzzy rulebase R has the general form:

    Rj : If x1 is Aj1 and x2 is A

    j2 and . . . xn is A

    jn then y is B

    j , j = 1, 2, . . . , N,(12)

    where x = (x1, x2, . . . , xn) is an input vector, y is an output value and Aji

    and Bj are fuzzy sets. In order to infer the output from a crisp input, therst step is to fuzzify input values. This is achieved by evaluating a degree ofmembership in each of the fuzzy sets, describing the variable.

    The overall process of fuzzy inference is articulated in consecutive steps[68]. At rst, a fuzzication of input values is needed, in order to infer the out-put from a crisp input. This is achieved by evaluating a degree of membershipin each of the fuzzy sets. In this way, an expression for the relation of the j-thrule can be found. By interpreting the rule implication by a conjunction-basedrepresentation2, it is possible to express the relation of the j-th rule as follows:

    Rj (x1, x2, . . . , xn, y) = Aj1(x1) Aj2(x2) . . . Ajn(xn) Bj (y), (13)where denotes the operator generalising the fuzzy AND connective. Theaggregation of all fuzzy rules in the rule base is achieved by:

    R(x1, x2, . . . , xn, y) =N


    Rj (x1, x2, . . . , xn, y), (14)

    2 This kind of interpretation for an IF-THEN rule assimilates the rule with theCartesian product of the input/output variable space. Such an interpretation iscommonly adopted, like in the cases of Mamdani [69] and Takagi-Sugeno-Kang(TSK) models [70], but it does not represent the only kind of semantics for fuzzyrules [71].

  • 246 P. Gorecki et al.

    where is the operator generalising the fuzzy OR connective, and R is amembership function characterising a fuzzy output variable.

    The last step of the process is defuzzication, which assigns appropriatecrisp value to the fuzzy set R described by the membership function (14), suchthat an output crisp value is provided at the end of the inference process. Forselecting this value, dierent defuzzication operators can be employed [72],among them: the centre of area (evaluating the centroid of the fuzzy outputmembership), the mean - smallest, largest - of maxima, (evaluating the mean -smallest or largest - of all maximum points of the membership function).

    No standard techniques are applicable for transforming the human knowl-edge into a set of rules and membership functions. Usually, the rst step is toidentify and name the system inputs and outputs. Then, their value rangesshould be specied and a fuzzy partition of each input and output should bemade. The nal step is the construction of the rule base and the specicationof the membership functions for the fuzzy sets.

    As an illustrative example, we show how fuzzy systems can be employedto obtain a simple process of text information extraction. Let us consider theproblem of a decision task, based on classication of small image blocks astext or background. By examining the blocks extracted from the image, it canbe observed that the background is usually bright, with little or no variationsin grey-scale. On the other hand, text contains high variations in grey-scale, asthe block contains black text pixels and white background pixels, or it is blackwith small grey-scale variance (in case of larger headings fonts). The aboveobservations allow to formulate a set of rules, containing linguistic variables,with the employment of such features as the mean and the standard deviationof pixel values:

    R1: IF mean is dark AND std. dev. is low THEN background is low. R2: IF mean is dark AND std. dev. is high THEN background is low. R3: IF mean is grey AND std. dev. is high THEN background is low. R4: IF mean is white AND std. dev. is low THEN background is high.The foregoing simple rules allow us to infer the membership degree bi of thei-th block to the background class, while the membership degree ti to the textclass can be obtained simply as: ti = 1 bi.

    In order to obtain the segmentation of a document image, this should bepartitioned into regular grid of small blocks (i.e. with size of 44 or 88 pix-els, depending on the size of the image). Successively, fuzzy rules are evaluatedbased on the features of each block. Figure 3 illustrates the sets of member-ship functions dened for the input values. Figure 4 illustrates the inferenceprocess for a sample input value: each row corresponds to one of the rules inthe rule base previously described, with two input membership functions andone output membership function. Degrees of membership (vertical lines) arecalculated based on illustrative crisp inputs (mean = 193, std. dev. = 32). Theactivation function of each rule is calculated by adopting the min function,according to (13). Finally, all activation functions are aggregated using the

  • Fuzzy Techniques for Text Localisation in Images 247

    (a) (b)

    Fig. 3. Membership functions of the variables mean(a) and std. dev. (b) employedfor segmentation of document images

    Fig. 4. Fuzzy inference process performed over illustrative input values

    max function, according to (14). The crisp value (equal to 0.714, as shownin Fig. 4) is calculated by defuzzifying the output value, employing the centreof area method. Results obtained by employing this approach on a sampledocument image are presented in Fig. 5.

    3.4 Fuzzy C-Means Clustering

    Traditional clustering approaches generate partitions where each pattern be-longs to one and only one cluster. Fuzzy clustering extends this notion usingthe concept of membership function. In this way, the output of this kindof fuzzy algorithms is a clustering but not a partition. The Fuzzy C-Meansmethod of clustering was developed by Dunn in [73] and improved by Bezdekin [74], and it is frequently used in data clustering problems. The Fuzzy C-Means (FCM) is a partitional method, that is derived from the K-Meansclustering [75]. The main dierence between FCM and K-Means is that theformer allows for one piece of data to belong to many clusters with certainmembership degrees. In other words, the partitioning of the data is fuzzyrather than crisp.

    Given the number of clusters m, the distance metric d(x, y) and an ob-jective function J , the goal is to assign the samples {xi}ki=1 into clusters.

  • 248 P. Gorecki et al.

    (a) (b)

    Fig. 5. Document image segmentation with employment of a fuzzy system. Originaldocument image (a), obtained segmentation (b)

    In particular, the Fuzzy C-Means algorithm is based on minimisation of thefollowing objective function:

    Js =m



    (uij)sd(xi, cj)2, 1 < s

  • Fuzzy Techniques for Text Localisation in Images 249

    by low membership values of the same cluster, or adjacent high membershipvalues of dierent clusters. Examples of such penalised objective function wereproposed in [76].

    The Fuzzy C-Means method has been applied in a variety of image seg-mentation problems, such as medical imaging [77] or remote sensing [78].

    3.5 Neuro-Fuzzy Systems

    Integration of fuzzy logic and neural networks boasts a consolidated presencein scientic literature [7983]. The motivations behind the success of this kindof combination can be easily assessed by referring to the issues introduced inthe previous section. In fact, by means of fuzzy logic it is possible to facilitatethe understanding of decision processes and to provide a natural way for theinterpretation of linguistic rules. On the other hand, rules in fuzzy systemscannot be acquired automatically. The designing process of rules and mem-bership functions is always human-driven and reveals to be dicult, especiallyin case of complex systems. Additionally, tuning of fuzzy membership func-tions representing linguistic labels is a very time consuming process, but it isessential if accuracy is a matter of concern [84].

    Neural networks are characterised by somewhat opposite properties. Theyhave the ability to generalise and to learn from data, obtaining knowledgeto deal with previously unseen patterns. The learning process is relativelyslow for large sets of training data, and any additional information about theproblem cannot be integrated into the learning procedure in order to simplifyit and speed up the computation. Trained neural network can classify patternsaccurately, but the decision process is obscure for the user. In fact, informationis encoded in the connections between the neurons, therefore extraction ofstructural knowledge from the neural network is very dicult.

    Neuro-fuzzy systems allow to extract fuzzy rules from data during theknowledge discovery process. Moreover, the membership functions inside eachrule can be easily tuned, based on information embedded in data. In orderto perform both tasks, the expert intervention can be avoided by resorting toneural learning and a training set T of t samples is required. In particular,the i-th sample in the training set is a pair of input/output vectors (xi,yi),therefore T = {(x1,y1), . . . , (xt,yt)}. In case of classication problems, theinput vector xi is an m-dimensional vector containing the m measurementsof the input features, while the output vector yi is an n-dimensional binaryvector, codifying the membership of xi for each of the n classes (i.e., yi is oneof the linearly independent basis vectors spanning the Rn space).

    In the following, we are going to introduce the peculiar scheme of a neuro-fuzzy model, whose application in text localisation problems will be detailedin the next section.

  • 250 P. Gorecki et al.

    A Peculiar Scheme for a Neuro-Fuzzy System

    The fuzzy component of the neuro-fuzzy system is represented by a partic-ular fuzzy inference mechanism whose general scheme is comparable to theTakagi-Sugeno-Kang (TSK) fuzzy inference method [70]. The fuzzy rule baseis composed by K fuzzy rules, where the k-th rule is expressed in the form:

    Rk : If x1 is A(k)1 and . . . and xm is A

    (k)m then y1 is b

    (k)1 and . . . and yn is b

    (k)n ,


    where x = (x1, . . . , xm) is the input vector, y = (y1, . . . , yn) is the outputvector, (A(k)1 , . . . , A

    (k)m ) are fuzzy sets dened over the elements of the input

    vector x, and (b(k)1 , . . . , b(k)n ) are fuzzy singletons dened over the elements

    of the output vector y. Each of the fuzzy sets A(k)i is dened in terms of aGaussian membership function (k)i :

    (k)i (xi) = exp

    xi c(k)i


    , (18)

    where c(k)i is the centre and (k)i is the width of the Gaussian function. The

    rule fullment degree of the k-th rule is evaluated using the formula:

    (k)(x) =mi=1

    (k)i (xi), (19)

    where the product function is employed to interpret the AND connective. Thenal output of the fuzzy model can be expressed as:

    yj =



    (k)(x), j = 1, . . . , n. (20)

    In classication tasks, the elements of the output vector y express in therange [0, 1] the membership degrees of the input pattern for each of the classes.In order to obtain a binary output vector y = {yj}nj=1, the defuzzication ofthe output vector y is performed as follows:

    yj ={

    1 if yj = max(y),0 otherwise. (21)

    By means of (21), the input pattern is classied in according with the highestmembership degree.

    The neural component of the neuro-fuzzy system is represented by a par-ticular neural network which reects in its topology the structure of the pre-viously presented fuzzy inference system. The network is composed by fourlayers with the following characteristics:

  • Fuzzy Techniques for Text Localisation in Images 251

    Layer 1 provides the crisp input vector x = (x1, . . . , xm) to the network. Thislayers does not perform any calculation and the input vector values aresimply passed to the second layer.

    Layer 2 realises a fuzzication of the input variables. Units in this layer areorganised into K distinctive groups. Each group is associated with one ofthe fuzzy rules, and it is composed of m units, corresponding to the mfuzzy sets in the fuzzy rule. The i-th unit in the k-th group, connectedwith the i-th neuron in layer 1, evaluates the Gaussian membership degreeof the fuzzy set A(k)i , according to (18).

    Layer 3 is composed of K units. Each of them performs the preconditionmatching of one of the rules and reports its fullment degree, in accordancewith (19). The i-th unit in this layer is connected with all units in the i-thgroup of layer 2.

    Layer 4 supplies the nal output vector y and is composed of n units. The i-thunit in this layer evaluates the element yi, according to (20). In particular,the fullment degrees of the rules are weighted by the fuzzy singletons,which are encoded as the values of the connections weights between layer3 and layer 4.

    Figure 6 depicts the structure of the above described neuro-fuzzy network,with reference to a neuro-fuzzy system with two inputs, three rules and twooutputs.

    Fig. 6. Structure of the neuro-fuzzy network coupled with a neuro-fuzzy systemexhibiting two inputs, three rules and two outputs (m = 2, K = 3, n = 2)

  • 252 P. Gorecki et al.

    As concerning the learning procedure of the neuro-fuzzy network, two dis-tinctive steps are involved. The rst one is devoted to discovering the initialstructure of the neuro-fuzzy network. Successively, the parameters of the fuzzyrules are rened, so that the overall classication accuracy is improved. Duringthe rst step, a clustering of the input data is performed by an unsupervisedlearning process of the neuro-fuzzy network: each cluster corresponds to one ofthe nodes in the rule layer of the neuro-fuzzy network. The clustering processis able to derive the proper number of clusters. In fact, a rival penalised mecha-nism is employed to adaptively determine the suitable structure of the networkand therefore the number of fuzzy rules (starting from a guessed number). Inthis way, an initial knowledge is extracted from data and expressed in the formof a base of rules. The obtained knowledge is successively rened during thesecond step, where a supervised learning process of the neuro-fuzzy networkis accomplished (based on a gradient descent technique), in order to attunethe parameters of the fuzzy rule base to the numerical data. For the sakeof conciseness, we omit further mathematical details concerning the learningalgorithms, addressing the reader to [85].

    4 Text Localisation: Illustrative Applications

    As previously stated, the dierent techniques for image segmentation presentsome drawbacks. Classical top-down approaches, based on run-length encod-ing and projection proles, are sensitive to skewed text and perform well onlywith highly structured page layouts. On the contrary, bottom-up approachesare sensitive to font size, scanning resolution, interline and inter-characterspacing.

    To overcome these problems, the employment of Computational Intelli-gence methods would be benecial. Here we detail some of our experimentswith the employment of fuzzy and neuro-fuzzy techniques. With reference tothe classication directions proposed in this chapter, the rst approach we aregoing to introduce can be classied as a region-based approach, which standsas a preliminary naive formulation of our research activity [86]. The involvedimage regions are classied as text or graphic regions, on the basis of theirappearance (regularity) and shape. The classication process is realised byemploying the peculiar neuro-fuzzy model described in Sect. 3.5.

    The second approach proposed is somewhat more involved and it is re-lated to a multi-resolution segmentation scheme, belonging to the categoryof edge-based bottom-up approaches [87]. Here pixels are classied as text,graphics, or background, in accordance with their grey-level intensity andedge strength values, extracted from dierent resolution levels. In order toimprove the segmentation results obtained from the initial pixel level classi-cation phase, a region level analysis phase is performed. Both steps, namelypixel level analysis and region level analysis, are realised by the employmentof the already mentioned neuro-fuzzy methodology.

  • Fuzzy Techniques for Text Localisation in Images 253

    The third approach, representing an example of texture-based bottom-upapproach, is based on a more sophisticated tool for multi-resolution analysiswith Discrete Wavelet Packet Transform [88]. To discriminate between textand non-text regions, the image is transformed into a Wavelet packet analysistree. Successively, the feature image, exploited for the segmentation of text andnon-text regions, is obtained from some of the nodes selected from the quad-tree. The most discriminative nodes are derived using an optimality criterionand a genetic algorithm. Finally, the obtained feature image is segmented bymeans of a Fuzzy C-Means clustering.

    All the proposed segmentation approaches have been evaluated using theDocument Image Database available from the University of Oulu [89]. Thisdatabase includes 233 images of articles, scanned from magazines and news-papers, books and manuals. The images vary both in quality and contents:some of them contain text paragraphs only (with Latin and Cyrillic fonts ofdierent sizes), while others contain mixtures of text, pictures, photos, graphsand charts. Moreover, not all the documents are characterised by regular(Manhattan) page layout.

    4.1 Text Region Classication by a Neuro-Fuzzy Approach

    The idea at this stage is to exploit a neuro-fuzzy classier to label the dierentregions composing a document image. The work assumes that a database ofsegmented images is available, from which it is possible to extract a set ofnumerical features.

    The rst step is a feature extraction process and consists in detectingthe skew angle of each region as the dominant orientation of the straightlines passing through that region. Inside the text regions, being composed ofcharacters and words, the direction of the text lines will be highly regular.This regularity can be captured by means of the Hough transform [22,9092].Particularly, the skew angle is detected as the angle for which the Houghtransform of a specic region has the maximum value.

    The retrieved skew angle is used to obtain the projection prole of thedocument region. The prole is calculated by accumulating pixel values inthe region along its skew angle, so that the one-dimensional projection vectorvp is obtained. The elements of vp codify the information about the spatialstructure of the analysed region. For a text region, vp should have regular, highfrequency sinusoidal-like shape with peaks and valleys corresponding to thetext lines and the interline spacings, respectively. In contrast, such regularitiescannot be observed, when a graphics region is considered. To measure theregularity of the vp vector, the Power Spectral Density (PSD) [22] analysis isperformed. Actually, for large paragraphs of text, the PSD coecients showa signicant peak around the frequency value corresponding approximatelyto the number of text lines in this region. For graphic regions, instead, the

  • 254 P. Gorecki et al.

    (a) (b) (c)

    Fig. 7. A region of a document image (a), its projection prole calculated for skewangle of 90 degrees (b) and PSD spectrum of the prole (c)

    spectrum presents only a few peaks (one or two) around the lowest frequencyvalues. A vector vpsd of PSD coecients is calculated as follows:

    vpsd = |FT (vp)|2, (22)

    where FT () denotes the Fourier Transform [93]. An illustrative projectionprole and its PSD spectrum for a sample text region is presented in Fig. 7.

    Generally, the number of the components of the PSD spectrum vector vpsdis too large to be directly used as a feature vector for the classication tasks.In order to reduce the dimensionality of vpsd, it can be divided into a numberof intervals. In particular, we considered some intervals of dierent lengths,corresponding to the scaled Fibonacci sequence, with multiplying factor equaltwo (i.e., 2, 4, 6, 10, 16, 26, 42, . . .). In this way, we are able to preserve and toexploit mostly of the information accumulated in the rst part of the PSDspectrum. For each interval, the maximum value of vpsd is derived, and theobtained maxima (normalised with respect to the highest one) represent therst seven components of the feature vector vf , which will be employed inthe successive region classication stage. To increase classication accuracy,statistical information concerning the connectivity of the analysed region isextracted, thus extending the feature number of the vector vpsd. At the end ofthe overall feature extraction process, every region of the segmented documentimage is represented as a feature vector vf with ten elements, which are usedfor the classication purposes.

    The nal step is the classication of the regions described in terms of thefeature vector vf . Such a classication process has been performed by meansof the neuro-fuzzy system introduced in Sect. 3.5. In the course of the exper-imental session concerning the image region classication, the input vectorx, involved in the fuzzy inference model, corresponds to the ten-dimensional

  • Fuzzy Techniques for Text Localisation in Images 255

    feature vector vf , derived during the feature extraction process. The outputvector y is related to the classes of the classication tasks (i.e., textual andgraphical regions).

    The overall algorithm can be summarised as follows:

    For each region:

    1. Calculate skew angle by means of Hough transform2. Obtain projection prole vp of the region along 3. Calculate vpsd from vp4. Obtain vf by dividing vpsd into intervals5. Classify the region as text or graphics on the basis of vf by means of the

    neuro-fuzzy inference

    4.2 Text Localisation by a Neuro-Fuzzy Segmentation

    The idea at this stage consists in exploiting a neuro-fuzzy classier for achiev-ing both the segmentation of a document image and the nal labelling ofthe derived regions. The described work is related to an edge-based approachfor document segmentation, aiming at the identication of text, graphic andbackground regions. The overall methodology is based on the execution oftwo successive steps, working at dierent levels, conguring a bottom-up ap-proach. In particular, an edge-based pre-processing step concerns a pixel levelanalysis, devoted to a preliminary classication of each image pixel into oneof the previously described general classes. From the results of this phase,coherent regions are obtained by a merging procedure. To rene the obtainedsegmentation, an additional post-processing is performed at region level, onthe basis of shape regularity and skew angle analysis. This post-processingphase is benecial for obtaining a nal accurate segmentation of the documentimage. The peculiarity of the proposed approach relies on the employment ofthe neuro-fuzzy system both in the pre-processing pixel level analysis and inthe post-processing region level renement.

    Low-Level Pixel Analysis

    The aim of the low level pixel analysis is to classify each pixel of a documentimage f(x, y) into text, background or graphic category, according to its greylevel and edge strength values. When extracting features from image data,the type of information that can be obtained may be strongly dependent onthe scales at which the feature detectors are applied [94]. This can be percep-tually veried with ease: when an image is viewed from near to far, the edgestrength of a pixel is decreased in general, but the relative decreasing ratesfor contour, regular and texture points are dierent. Moving from this kindof observation, we followed a multi-scale analysis of the image: assuming that

  • 256 P. Gorecki et al.

    an image f(x, y) is given, let R be the number of scale representations consid-ered for our analysis. In this way, a set of images {f (1)(x, y), . . . , f (R)(x, y)} isinvolved and an edge map e(x, y) can be obtained from each image by meansof the Sobel operator [22]. Since the information extracted from image datais strongly dependent on the image scale at which the feature detectors areapplied, we have represented the images f(x, y) and e(x, y) as Gaussian pyra-mids with R dierent resolution levels. In the pyramid, image at level r+1 isgenerated from image at level r by means of down-sampling by a factor of 2.Therefore, a set of edge maps {e(1)(x, y), . . . , e(R)(x, y)} is generated duringthe creation of the pyramids and associated to the set of multi-scaled images.

    By examining the luminosity and edge strength information of the image atdierent resolution levels, it is possible to formulate a set of rules that enablesthe pixel classication. In this way, a pixel (x, y) is characterised by a featurevector of length 2R, containing information about intensity and edge strengthat dierent resolution levels. Such a feature vector vxy can be formalised as:

    vxy = ((f (1)(x, y), f (2)(x/2, y/2), . . . , f (R)(x/2R1, y/2R1), (23)

    e(1)(x, y), e(2)(x/2, y/2), . . . , e(R)(x/2R1, y/2R1)).

    In order to derive a set of applicable rules encoding accurate information, weexploited the neuro-fuzzy system introduced in Sect. 3.5, which automaticallyderives a fuzzy rule base from a training set of manually labelled pixels. Inthis case, the neuro-fuzzy network consists of 2R inputs (corresponding to theelements of the vector vxy), while three output classes correspond to each ofthe recognised category of pixel (text, background, graphic).

    The obtained fuzzy rule base is applied to perform the pixel classicationprocess, which ultimately produces three binary images: btex(x, y), bgra(x, y)and bbac(x, y). The images are composed by pixel candidates of text, graphicand background regions, respectively. In order to obtain more coherent re-gions, a merging procedure is applied to each of the binary images, on thebasis of a set of predened morphological operations (including well-knowntechniques of image processing, such as erosion, dilation, hole lling [95]).

    High-Level Region Analysis

    The high-level region analysis is purposed to provide a renement of the textinformation extraction process. In other words, this step aims at detecting andcorrecting misclassied text regions identied during the previous analysis. Todo that, the shape properties of every text region are analysed as follows. Byexamining the image btex, containing text regions, we can rstly extract anumber of connected components {Et}Tt=1 representing the text regions to beanalysed. Particularly, we are interested in processing the images composedby the pixels representing the perimeter of each region Et. Each of them ismapped by the Hough transform from spatial coordinates of Et(x, y) to polarcoordinates of Ht(d, ), where d denotes the distance from line to the origin,

  • Fuzzy Techniques for Text Localisation in Images 257

    and 0, ) is the angle between this line and x axis. The one-dimensionalfunction

    h() = maxd

    Ht(d, ), (24)

    (which is applied for each value of ), contains information about the angles ofthe most dominant lines in the region Et. In general, for a rectangular region,with a skew angle of degrees, the plot of h() has two signicant maximumvalues located at: {

    1 = degrees2 = + 90 degrees,


    corresponding to the principal axes of the region. The presence or absenceof such maxima is exploited to classify each text region as rectangular ornon-rectangular, respectively.

    To obtain a set of linguistic rules suitable for this novel classication task,the neuro-fuzzy model adopted for classifying the image pixels is employedonce again. In this case, the input vector x can be dened in terms of 20elements, which synthetically describe the information content of h(). Par-ticularly, the normalised values of h() have been divided into 20 intervalsof equal lengths, and the elements of x represent the mean values of h() ineach interval. The number of the intervals has been empirically selected as acompromise between the length of the input vector (thus, a complexity of theneuro-fuzzy network structure) and the amount of information required forfollowing classication task (accuracy of a classication). Moreover, h() hasbeen normalized, as the amplitude of the function carry information about thesize of the region, which is irrelevant in this particular case and would ham-per the classication process. The region Et under analysis can be ultimatelyclassied in one of two possible output classes: non-rectangular shape (in thiscase Et is denitively labelled as graphic region) and rectangular shape. Thislatter case opens the way for an analysis performed over the skew angle value.In particular, the skew angle t of a region Et is chosen as the minimum anglevalue 1t (see (25)), while the overall skew angle of the document is chosenas the most often occurring skew angle along all rectangular regions. Succes-sively, simple thresholding is applied: if |t | is greater than some smallangle , then the rectangular region Et is re-classied as a graphic region; oth-erwise, Et retains its original text classication. Finally, graphic regions arerecursively enlarged by bounding boxes surrounding them, which are alignedaccording to .

    The overall proposed algorithm can be summarised as follows:

    For an input document image f(x, y):

    1. Create a Gaussian pyramid of {f (1)(x, y), . . . , f (R)(x, y)}.2. For each level f (i)(x, y) of a pyramid, apply Sobel operator to calculate

    its edge image e(i)(x, y).3. Classify each pixel of the image as text graphics or background according

    to values of luminosity and edge strength in the pyramid. Create three

  • 258 P. Gorecki et al.

    binary images: btex(x, y), bgra(x, y) and bbac(x, y) according to the classi-cation results.

    4. Process btex(x, y) and bgra(x, y): median lter, apply dilation, removesmall holes from the regions, apply erosion.

    5. For each connected component Et in btex obtain its perimeter (by remov-ing interior pixels) and calculate its skew angle t. Additionally, classifyEt as rectangular or non-rectangular.

    6. Calculate a histogram containing skew angles of connected componentsclassied as rectangular. The most occurring value is chosen as an overallskew angle .

    7. For each connected component Et: if it non-rectangular or it is not alignedwith an overall skew angle, then reclassify it as a graphics region:btex(x, y) = btex(x, y) Et(x, y), bgra(x, y) = bgra(x, y) Et(x, y).

    8. Enlarge graphics regions in bgra with bounding boxes aligned to .9. Setthebinary imageofabackgroundasbbac(x, y)= (btex(x, y)bgra(x, y)).

    4.3 Text Localisation by Wavelet Packet Segmentation

    In this section we propose our methodology for document page segmentationinto text and non-text regions based on Discrete Wavelet Packet Transforms.This approach represents an extension of the work presented in Sect. 4.2, whichis based on the Gaussian image pyramids. In fact, two-dimensional Waveletanalysis is a more sophisticated tool for multi-resolution analysis, if comparedto the image pyramids.

    The main concern of the methodology is the automatic selection of packetWavelet coecients describing text or background regions. Wavelet packetdecomposition acts as a set of band-pass lters, allowing to localise frequenciesin the image much better than standard Wavelet decomposition. The goal ofthe proposed feature extraction process is to obtain a basis for the Waveletsub-bands, that exhibit the highest discrimination power between text andnon-text regions. This stage is realised by the analysis of the quadtree obtainedby applying the Wavelet packet transform to a given image. In particular,the most discriminative nodes are selected among all the nodes {ci}| |i=1 inthe quadtree , where | | = d1j=0 22j is the total number of all nodes in thequadtree having depth d. This process is based on ground truth segmentationdata.

    Coecient Extraction

    Given an image f(x, y), the initial step consists in decomposing it usingWavelet packet transform, so that the quadtree of Wavelet coecients isobtained. An example of the decomposition is depicted in Fig. 8, where the

  • Fuzzy Techniques for Text Localisation in Images 259

    (a) (b)


    Fig. 8. DWPT decomposition of the image (a) at levels 12 (bc). Each subimagein (bc) is a dierent node of the DWPT tree

    coecients of the nodes at each decomposition level are displayed as sub-images. By visually analysing the gure, it can be observed that some of thesub-images appear to be more discriminating between text and non-text areas.

    To quantitatively evaluate the eectiveness of the node ci (associatedwith the matrix of Wavelet coecients) in discriminating between text andnon-text, the following procedure is performed. At rst, the Wavelet coe-cients ci are represented in terms of absolute values |ci|, because discrimina-tion power does not depend on the coecient signs. Then, the coecients aredivided into the sets Ti (text coecients) and Ni (non-text coecients), onthe basis of the known ground truth segmentation of the image f(x, y).

  • 260 P. Gorecki et al.

    For each set Ti and Ni, the mean and variance values are calculated,denoted as Ti and

    Ti for text and

    Ni and

    Ni for non-text, respectively.

    After that, the discrimination power Fi of the node ci is evaluated using thefollowing optimality criterion, based on the Fishers criterion [96]:

    Fi =(Ti Ni )2Ti +


    . (26)

    To a certain extent, Fi measures the signal-to-noise ratio in the text andnon-text classes. The nodes with maximum inter-class distance and minimumintra-class variance have the highest discrimination power.

    The simplest approach to obtain the best set of nodes, denoted as , isto select the smallest number of nodes which have the highest discriminationpower. Then, a feature image f (x, y) can obtained from the selected nodes. In particular, the Wavelet coecients of the set are rescaled to the sizeof image f(x, y) and then added together:

    f (x, y) =i||

    ci(x, y), (27)

    where ci(x, y) denotes the |ci| values rescaled to match the size of the originalimage f(x, y). Even if the approach for obtaining is fast and simple, it isnot an optimal technique to maximise signal-to-noise ratio between text andnon-text classes. Moreover, the optimal number of nodes to be chosen for is unknown and it must selected manually.

    The problem of selecting the best nodes from all nodes available is a com-binatorial problem, producing an exponential explosion of possible solutions.We propose to solve this problem by employing a genetic algorithm [97, 98].In particular, each node ci is associated with a binary weight wi {0, 1},so the tree is associated with a vector of weights W = [w1, . . . , wi, . . . , w| |].Consequently, the subset of the best nodes is dened as = {ci : wi = 1}.

    Given a weights vector W of the nodes, the feature image f is calculatedas following:

    f (x, y,W ) =| |i=1

    wici(x, y). (28)

    The discrimination power F of the subset can be computed extendingthe (26), by evaluating the mean values T , N and the deviation values T ,N of the values in the feature image f corresponding to text regions (Tsuperscript) and non-text regions (N superscript):

    F =(T N )2T + N

    . (29)

    To nd the optimal subset by means of 28), a genetic algorithm is appliedin order to maximise the cost function F . Initially a random population of K

  • Fuzzy Techniques for Text Localisation in Images 261

    weight vectors {Wi : i = 1, . . . ,K}, represented as binary strings, is created.Successively, for each weight vector the feature image is calculated and itscost function is evaluated using the (29). The best individuals are subject tocrossover and mutation operators in order to produce the next generation ofweight vectors. Finally, the optimal subset is found from the best individualsin the evolved vector population.

    Finally, the feature image f (x, y) is obtained from merging the set ofcoecients in the nodes , as described in (27) or (28).

    5 Experimental Results and Discussion

    To test the eectiveness of the presented methodology, we have employeda publicly available document image database [89]. In particular, the pre-liminary region-based approach we rstly presented has been tested on 306graphic regions and 894 text regions which have been extracted from the data-base and automatically labelled. The extracted feature vectors were dividedinto a training set composed of 900 samples and a testing set composed of theremaining 300 observations. Proportions between text and graphics regionswere preserved in both the datasets.

    A set of 12 fuzzy rules have been extracted from the training set by meansof the unsupervised neuro-fuzzy learning procedure previously detailed. Suc-cessively, the rules have been rened using the gradient-descend technique ofback-propagation. Table 1 reports the classication accuracy over the train-ing and testing set produced by the neuro-fuzzy system, both for initial andrened base of rules. Classication results are satisfactory in terms of the ac-curacy. However, the most common error is the misclassication of short (one,or two lines of text) text regions, as can be observed also in Fig. 9. The mainreason for that is the insucient regularity in the projection proles of suchregions. Nevertheless, the strong points of the proposed method rely on theability to process skewed documents, and the invariance to font shape andfont size.

    The second approach proposed has been tested using 40 images related tomagazines and newspapers, drawn from the Oulu document image database.For the purpose of pixel classication, a three level Gaussian pyramid was builtfrom the original image. From the knowledge extraction process performed by

    Table 1. Overall classication accuracy of the document regions

    Number ofrules

    % classication

    Training set Test set

    Initial fuzzy rule base 12 95.71 93.53

    Rened fuzzy rule base 12 95.80 93.60

  • 262 P. Gorecki et al.

    (a) (b)

    Fig. 9. Classication results obtained for two sample images. Dark regions havebeen classied as text, while light regions have been classied as graphics

    Table 2. Pixel level classication accuracy

    Data set Text (%) Graphics (%) Background (%)

    Training 91.54 85.42 93.33Testing 91.54 86.05 95.66

    the neuro-fuzzy system over a pre-compiled training set, a fuzzy rule basecomprising 12 rules has been obtained. Table 2 reports the accuracy of thepixel classication process (considering both a training and a testing set); theclassication results for an illustrative image from the database, are presentedin Fig. 10.

    The further application of the neuro-fuzzy system, during the high-levelanalysis, was performed over a pre-compiled training set including the featurevector information related to 150 regions. The obtained rule base comprises10 fuzzy rules and its classication accuracy is reported in Table 3, consid-ering both training and testing sets. The nal segmentation results for thepreviously considered sample image are presented in the Fig. 11.

    The accuracy of the method can be quantitatively measured using a groundtruth knowledge deriving from the correct segmentation of the 40 imagesemployed. The eectiveness of the overall process is expressed by a measureof segmentation accuracy Pa, dened as:

    Pa =Number of correctly segmented pixels

    Number of all pixels 100%. (30)

    Table 4 reports the mean values of segmentation accuracy obtained over theentire set of analysed images, distinguishing among the dierent methodology

  • Fuzzy Techniques for Text Localisation in Images 263

    (a) (b) (c)


    Fig. 10. Classication of the pixels of an input image (a) into text (b), graphics (c)and background (d) classes

    Table 3. Region level classication accuracy

    Data set Rectangular (%) Non rectangular (%)

    Training 97.43 92.85Testing 94.11 93.93

    steps. The apparently poor results obtained at the end of the pixel classi-cation step are due to the improper identication of text regions (only thepixels corresponding to the words are classied as text). The eectiveness ofthe initial stage of pixel classication is demonstrated by the rapid increaseof the accuracy values achieved in the subsequent merging process.

    The quantitative measure of the segmentation accuracy allows for com-parison with other existing techniques. As an example, we can compare theresults illustrated in Table 4 with those reported in [17], where a polynomialspline Wavelet approach has been proposed and the same kind of measure hasbeen employed to quantify the overall accuracy. Particularly, the best resultsin [17] achieved an accuracy of 98.29%. Although our methodology producedslightly lower accuracy results, it should be observed that we analysed a total

  • 264 P. Gorecki et al.

    (a) (b) (c)


    Fig. 11. Final segmentation of a sample gure (a) into text (a), graphics (b) andbackground (c) regions

    Table 4. Overall segmentation accuracy expressed in terms of Pa. PC and MOstand for Pixel Classication and Morphological Operation, respectively

    Text Graphics (%) Bckgr (%) Image (%)

    PC 59.92 88.32 52.93 50.59PC + MO 96.65 90.63 93.26 90.27Final 98.19 96.36 97.99 97.51

    number of 40 images, instead of the 6 images considered in [17]. Finally, itcan be noted that our approach may be extended to colour documents usingthe HSV system [22]. In this case, the Gaussian pyramid could be evaluatedfor the H and S components and the edge information for the V components.

    The texture-based approach lastly presented has been tested on 40 imagesextracted from the Oulu database: in order to obtain the feature images, eachimage has been decomposed by Daubechies db2 Wavelet functions [59] in threelevel coecients. One of these document images has been manually segmented,to create ground truth segmentation data. The best nodes have been selectedby means of a basic genetic algorithm [97, 98] with an initial population of

  • Fuzzy Techniques for Text Localisation in Images 265

    20 weight vectors. New generations of vector population have been producedby crossover (80%) and mutation operator (20%). After 50 generations, thebest subset of nodes has been obtained, containing 39 out of all 85 nodes.Additionally, it should be noted that more than one image can be combinedinto one larger image for the purpose of the node selection.

    Using the selected nodes, the feature images f (x, y) have been evaluatedfor each considered image. Then, we applied the Fuzzy C-Means algorithm[74] to each image f (x, y), in order to group its pixels into two clusters,corresponding to text and non-text regions. The nal segmented image hasbeen obtained by replacing each pixel of f (x, y) with its cluster label. Asthe clustering is not performed in the image space but in the feature space,additional post processing is necessary to rene segmentation. In particular,a median lter is applied to remove small noisy regions, while preserving theedges of larger regions. Successively, a morphological closing is applied on theltered image, in order to merge nearby text regions (i.e. letters, words, textlines) into larger ones (i.e. paragraphs, columns). Figure 12 shows an exampleof feature image, obtained from a document page, and its nal segmentation.

    The percentage of segmentation accuracy has been evaluated by the mea-sure of segmentation accuracy Pa previously described. For this purpose, theground truth segmentation of each image has been obtained automatically,according to the additional information in the database. Moreover, to testthe robustness of the method against page skew, some of the images havebeen randomly rotated. The obtained segmentation accuracy has an averagevalue of 92.63% presenting the highest value of 97.18% and the lowest valueof 84.37%. Some results are shown in Fig. 13. The results are comparable withother state-of-art document image segmentation techniques. Once again, wereport as an example that the best resulted obtained in [17] is 98.29% (overonly 6 images considered). Anyway, the approach proves to be robust againstpage skew and provides good results when dealing with images presentingdierent font sizes and style.

    (a) (b) (c)

    Fig. 12. Document image (a), its corresponding feature image (b) and segmentationresult (c)

  • 266 P. Gorecki et al.

    (a) (b) (c)

    Fig. 13. Segmentation results. Segmentation of the document image (a), invarianceto page skew (b) and invariance to font changes (c)

    6 Conclusions

    Text information represents a very important component among the contentsof a digital image. The importance of achieving text information by meansof image analysis is straightforward. In fact, text can be variously used todescribe the content of a document image, and it can be converted into elec-tronic format (for memorisation and archiving purposes). In particular, dier-ent steps can be isolated corresponding to the sequential sub-problems whichcharacterise the overall text information extraction task. In this chapter, weaddressed the specic problem connected with text localisation.

    The peculiarity of the present work consists in discussing text localisationmethods based on the employment of fuzzy techniques. When dealing withtext localisation, we are particularly involved with the problem of digital imagesegmentation and the adoption of the fuzzy paradigm is desirable in sucha research eld. That is due to the uncertainty and imprecision present inimages, deriving from noise, image sampling, lightning variations and so on.Fuzzy theory provides a mathematical tool to deal with the imprecision andambiguity in an elegant and ecient way. Fuzzy techniques can be applied todierent phases of the segmentation process; additionally, fuzzy logic allowsto represent the knowledge about the given problem in terms of linguisticrules with meaningful variables, which is the most natural way to express andinterpret information.

    After reviewing a number of classical image segmentation methods, weprovided a presentation of fuzzy techniques which commonly nd applicationin the context of digital image processing. Particularly, we showed the bene-ts coming from the fruitful integration of fuzzy logic and neural computationand we introduced a particular model for a neuro-fuzzy system. By doing so,we indicated a way to combine Computational Intelligence methods and doc-ument image analysis. Actually, a number of research works of ours have been

  • Fuzzy Techniques for Text Localisation in Images 267

    illustrated as examples of applications of fuzzy and neuro-fuzzy techniques fortext localisation in images.

    The presentation of the research works is intended to focus the interest ofthe reader on the possibilities of these innovative methods, which are by nomeans exhausted with the hints provided in this chapter. In fact, a number offuture research lines can be addressed, ranging from the analysis of dierentimage features (such as colour), to the direct application of ComputationalIntelligence mechanisms to deal with the large amount of web image contents.


    1. Colombo C, Del Bimbo A, Pala P (1999) IEEE Multimedia 6(3):38532. Long F, Zhang H, Feng D (2003) Fundamentals of content-based image retr-

    ieval, in: Feng D ZHE Siu WC (ed.) Multimedia information retrieval andmanagement - technological fundamentals and applications. Springer, BerlinHeidelberg New York

    3. Yang M, Kriegman D, Ahuja N (2002) IEEE Trans Pattern Anal Mach Intell24(1):3458

    4. Dingli A, Ciravegna F, Wilks Y (2003) Automatic semantic annotation usingunsupervised information extraction and integration, in: Proceedings of semAn-not workshop

    5. Djioua B, Flores JG, Blais A, Descles JP, Guibert G, Jackiewicz A, Priol FL,Nait-Baha L, Sauzay B (2006) EXCOM: An automatic annotation Engine forsemantic information, in: Proceedings of FLAIRS conference, pp. 285290

    6. Orasan C (2005) Automatic annotation of corpora for text summarisation: Acomparative study, in: Computational linguistics and intelligent text processing,volume 3406/2005, Springer, Berlin Heidelberg New York

    7. Karatzas D, Antonacopoulos A (2003) Two Approaches for Text Segmentationin Web Images, in: Proceedings of the 7th International Conference on Docu-ment Analsis and Recognition (ICDAR2003), IEEE Computer Society Press,Cambridge, UK pp. 131136

    8. Jung K, Kim K, Jain A (2004) Pattern Recognit 37:9779979. Chen D, Odobez J, Bourlard H (2002) Text segmentation and recognition in

    complex background based on Markov random eld, in: Proceedings of Interna-tional Conference on Pattern Recognition, pp. 227230

    10. Li H, Doerman D, Kia O (2000) IEEE Trans Image Process 9(1):14715611. Li H, Doermann D (2000) Superresolution-based enhancement of text in digital

    video, in: Proceedings of International Conference of Pattern Recognition, pp.847850

    12. Li H, Kia O, Doermann D (1999) Text enhancement in digital video, in: Pro-ceedings of SPIE, Document Recognition IV, pp. 18

    13. Sato T, Kanade T, Hughes E, Smith M (1998) Video OCR for digital newsarchive, in: Proceedings of IEEE Workshop on Content based Access of Imageand Video Databases, pp. 5260

    14. Zhou J, Lopresti D, Lei Z (1997) OCR for world wide web images, in: Proceed-ings of SPIE on Document Recognition IV, pp. 5866

  • 268 P. Gorecki et al.

    15. Zhou J, Lopresti D, Tasdizen T (1998) Finding text in color images, in: Pro-ceedings of SPIE on Document Recognition V, pp. 130140

    16. Ching-Yu Y, Tsai WH (2000) Signal Process.: Image Commun. 15(9):78179717. Deng S, Lati S, Regentova E (2001) Document segmentation using polynomial

    spline wavelets, Pattern Recognition 34:2533254518. Lu Y, Shridhar M (1996) Character segmentation in handwritten words, J. of,

    Pattern Recognit 29(1):779619. Mital D, Leng GW (1995) J Microcomput Appl 18(4):37539220. Rossant F (2002) Pattern Recognit Lett 23(10):1129114121. Xiao Y, Yan H (2003) Text extraction in document images based on Delaunay

    triangulation, Pattern Recognition 36(3):79980922. Pratt W (2001) Digital image processing (3rd edition). Wiley, New York, NY23. Haralick R (1979) Proc IEEE 67:78680424. Haralick R, Shanmugam K, Dinstein I (1973) Textural features for image clas-

    sication, IEEE Trans Syst Man Cybern 3:61062125. Baird H, Jones S, Fortune S (1990) Image segmentation by shape-directed cov-

    ers, in: Proceedings of International Conference on Pattern Recognition, pp.820825

    26. Nagy G, Seth S, Viswanathan M (1992) Method of searching and extractingtext information from drawings, Computer 25:1022

    27. OGorman L (1993) IEEE Trans Pattern Anal Mach Intell 15:1162117328. Kose K, Sato A, Iwata M (1998) Comput Vis Image Underst 70:37038229. Wahl F, Wong K, Casey R (1982) Graph Models Image Process 20:37539030. Jain A, Yu B (1998) IEEE Trans Pattern Anal Mach Intell 20:29430831. Pavlidis T, Zhou J (1992) Graph Models Image Process 54:48449632. Hadjar K, Hitz O, Ingold R (2001) Newspaper Page Decomposition Using a

    Split and Merge Approach, in: Proceedings of Sixth International Conferenceon Document Analysis and Recognition

    33. Jiming L, Tang Y, Suen C (1997) Pattern Recognit 30(8):1265127834. Rosenfeld A, la Torre PD (1983) IEEE Trans Syst Man Cybern SMC-13:23123535. Sahasrabudhe S, Gupta K (1992) Comput Vis Image Underst 56:556536. Sezan M (1985) Graph Models Image Process 29:475937. Yanni M, Horne E (1994) A new approach to dynamic thresholding, in: Pro-

    ceedings of EUSIPCO94: 9th European Conference on Signal Processing 1, pp.3444

    38. Sezgin M, Sankur B (2004) J Electron Imaging 13(1):14616539. Kamel M, Zhao A (1993) Graph Models Image Process 55(3):20321740. Solihin Y, Leedham C (1999) Integral ratio: A new class of global thresholding

    techniques for handwriting images, in: IEEE Transactions on Pattern Analysisand Machine Intelligence PAMI-21, pp. 761768

    41. Trier O, Jain A (1995) Goal-directed evaluation of binarization methods, in:IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-17, pp.11911201

    42. Bow ST (2002) Pattern Recognition and Image Preprocessing 2nd edition.Dekker, New York, NY

    43. Jung K, Han J (2004) Pattern Recognit Lett 25(6):67969944. Ohya J, Shio A, Akamatsu S (1994) IEEE Trans Pattern Anal Mach Intell

    16(2):21422445. Wu S, Amin A (2003) Proceedings of Seventh international conference on Doc-

    ument Analysis and Recognition, volume 1, pp. 493497

  • Fuzzy Techniques for Text Localisation in Images 269

    46. Canny J (1986) IEEE Trans Pattern Anal Mach Intell 8(6):67969847. Chen D, Shearer K, Bourlard H (2001) Text enhancement with asymmetric lter

    for video OCR, in: Proceedings of International Conference on Image Analysisand Processing, pp. 192197

    48. Hasan Y, Karam L (2000) IEEE Trans Image Process 9(11):1978198349. Lee SW, Lee DJ, Park HS (1996) IEEE Trans Pattern Recogn Mach Intell

    18(10):1045105050. Grigorescu SE, Petkov N, Kruizinga P (2002) IEEE Trans Image Process

    11(10):1160116751. Livens S, Scheunders P, van de Wouwer G, Van Dyck D (1997) Wavelets for tex-

    ture analysis, an overview, in: Proceedings of the Sixth International Conferenceon Image Processing and Its Applications, pp. 581585

    52. Tuceryan M, Jain AK (1998) Texture analysis, in: Chen CH, Pau LF, WangPSP (eds.) The Handbook of Pattern Recognition and Computer Vision 2ndedition, World Scientic Publishing, River Edge, NJ pp. 207248

    53. Jain A, Bhattacharjee S (1992) Mach Vision Appl 5:16918454. Acharyya M, Kundu M (2002) IEEE Trans Circ Syst video Technol 12(12):

    1117112755. Etemad K, Doermann D, Chellappa R (1997) IEEE Trans Pattern Anal Mach

    Intell 19(1):929656. Mao W, Chung F, Lanm K, Siu W (2002) Hybrid Chinese/English text detection

    in images and video frames, in: Proceedings of International Conference onPattern recognition, volume 3, pp. 10151018

    57. Coifman R, Wickerhauser V (1992) IEEE Trans Inf Theory 38(2):71371858. Coifman RR (1990) Wavelet Analysis and Signal Processing, in: Auslander L,

    Kailath T, Mitter SK (eds.) Signal Processing, Part I: Signal Processing Theory,Springer, Berlin Heidelberg New York, pp. 5968, URL {citeseer.is-t}.psu.edu/coifman92wavelet.html

    59. Daubechies I (1992) Ten Lectures on Wavelets (CBMS - NSF Regional Confer-ence Series in Applied Mathematics), Soc for Industrial & Applied Math

    60. Bruce A, Gao HY (1996) Applied Wavelet Analysis with S-Plus, Springer, BerlinHeidelberg New York

    61. Mallat SG (1989) IEEE Trans Pattern Anal Mach Intell 11(7):67469362. Engelbrecht A (2003) Computational Intelligence: An Introduction, WileyNew

    York, NY63. Sincak P, Vascak J (eds.) (2000) Quo vadis computational intelligence?, Physica-

    Verlag64. Zadeh L (1965) Inform Control 8:33835365. Klir G, Yuan B (eds.) (1996) Fuzzy sets, fuzzy logic, and fuzzy systems: selected

    papers by Lot A. Zadeh, World Scientic Publishing, River Edge, NJ66. Pham T, Chen G (eds.) (2000) Introduction to Fuzzy Sets, Fuzzy Logic, and

    Fuzzy Control Systems, CRC , Boca Raton, FL67. Jawahar C, Ray A (1996) IEEE Signal Process Lett 3(8):22522768. Jin Y (2003) Advanced Fuzzy Systems Design and Applications, Physica/

    Springer, Heidelberg69. Mamdani E, Assilian S (1975) Int J Man-Mach Studies 7(1):11370. Sugeno M, Kang G (1988) Structure identication of fuzzy model, Fuzzy Sets

    Syst 28:153371. Dubois D, Prade H (1996) Fuzzy Sets Syst 84:169185

  • 270 P. Gorecki et al.

    72. Leekwijck W, Kerre E (1999) Fuzzy Sets Syst 108(2):15917873. Dunn J (1974) J Cybern 3:325774. Bezdek J (1981) Pattern Recognition with Fuzzy Objective Function Algorithms

    (Advanced Applications in Pattern Recognition), Springer, Berlin HeidelbergNew York URL http://www.amazon.co.uk/exec/obidos/ASIN/0306406713/citeulike-21

    75. Macqueen J (1967) Some methods of classication and analysis of multivariateobservations, in: Proceedings of the Fifth Berkeley Symposium on MathemticalStatistics and Probability, pp. 281297

    76. Pham D (2001) Comput Vis Image Underst 84:28529777. Bezdek J, Hall L, Clarke L (1993) Med Phys 20:1033104878. Rignot E, Chellappa R, Dubois P (1992) IEEE Trans Geosci Remote Sensing

    30(4):69770579. Jang JS, Sun C (1995) Proc of the IEEE 83:37840680. Kosko B (1991) Neural networks and fuzzy systems: a dynamical systems ap-

    proach to machinhe intelligence, Prentice Hall, Englewood Clis, NJ81. Lin C, Lee C (1996) Neural fuzzy systems: a neural fuzzy synergism to intelligent

    systems, Prentice-Hall, Englewood Clis, NJ82. Mitra S, Hayashi Y (2000) IEEE Trans Neural Netw 11(3):74876883. Nauck D (1997) Neuro-Fuzzy Systems: Review and Prospects, in: Proc. Fifth

    European Congress on Intelligent Techniques and Soft Computing (EUFIT97),pp. 10441053

    84. Fuller R (2000) Introduction to Neuro-Fuzzy Systems, Springer, Berlin Heidel-berg New York

    85. Castellano G, Castiello C, Fanelli A, Mencar C (2005) Fuzzy Sets Syst149(1):187207

    86. Castiello C, Gorecki P, Caponetti L (2005) Neuro-Fuzzy Analysis of Docu-ment Images by the KERNEL System, Lecture Notes in Articial Intelligence3849:369374

    87. Caponetti L, Castiello C, Gorecki P (2007) Document Page Segmentation usingNeuro-Fuzzy Approach, to appear in Applied Soft Computing Journal

    88. Gorecki P, Caponetti L, Castiello C (2006) Multiscale Page Segmentation us-ing Wavelet Packet Analysis, in: Abstracts of VII Congress Italian Society forApplied and Industrial Mathematics (SIMAI 2006), p. 210

    89. of Oulu Finland U, Document Image Database, http://www.ee.oulu./research/imag/document/

    90. Hinds S, Fisher J, DAmato D (1990) A document skew detection method usingrun-length encoding and Hough transform, in: Proc. of the 10th Int. Conferenceon Pattern Recognition (ICPR), pp. 464468

    91. Hough P (1959) Machine Analysis of Bubble Chamber Pictures, in: InternationalConference on High Energy Accelerators and Instrumentation, CERN

    92. Srihari S, Govindaraju V (1989) Mach Vision Appl 2:14115393. Gonzalez R, Woods R (2007) Digital Image Processing 3rd edition, Prentice

    Hall94. Lindeberg T (1994) Scale-space theory in computer vision, Kluwer, Boston95. Watt A, Policarpo F (1998) The Computer Image, ACM, Addison-Wesley96. Sammon J (1970) IEEE Trans Comput C-19:82682997. Holland J (1992) Adaptation in Natural and Articial Systems reprint edition,

    MIT, Cambridge, MA,98. Mitchell M (1996) An Introduction to Genetic Algorithms, MIT, iSBN:0-262-


  • Soft-Labeling Image Scheme Using FuzzySupport Vector Machine

    Kui Wu and Kim-Hui Yap

    School of Electrical and Electronic Engineering, Nanyang Technological University,Nanyang Avenue, Singapore 639798

    Summary. In relevance feedback of content-based image retrieval (CBIR) systems,the number of training samples is usually small since image labeling is a time con-suming task and users are often unwilling to label too many images during thefeedback process. This results in the small sample problem where the performanceof relevance feedback is constrained by the small number of training samples. In viewof this, we propose a soft-labeling technique that investigates the use of unlabeleddata in order to enlarge the training data set. The contribution of this book chapteris the development of a soft labeling framework that strives to address the smallsample problem in CBIR systems. By studying the characteristics of labeled im-ages, we propose to utilize an unsupervised clustering algorithm to select unlabeledimages, which we call soft-labeled images. The relevance of the soft-labeled imagesis estimated using a fuzzy membership function, and integrated into the fuzzy sup-port vector machine (FSVM) for eective learning. Experimental results based on adatabase of 10,000 images demonstrate the eectiveness of the proposed method.

    1 Introduction

    1.1 Background

    Recent explosion in the volume of image data has driven the demand for ef-cient techniques to index and access the image collections. These includeapplications such as online image libraries, e-commerce, biomedicine, militaryand education, among others. Content-based image retrieval (CBIR) has beendeveloped as a scheme for managing, searching, ltering, and retrieving theimage collections. CBIR is a process of retrieving a set of desired images fromthe database on the basis of visual contents such as color, texture, shape,and spatial relationship that are present in the images. Traditional text-basedimage retrieval uses keywords to annotate images. This involves signicant

    K. Wu and K.-H. Yap: Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine,

    Studies in Computational Intelligence (SCI) 96, 271290 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 272 K. Wu and K.-H. Yap

    amount of human labor in manual annotation of large-scale image databases.In view of this, CBIR is proposed as an alternative to text-based image re-trieval. Many research and commercial CBIR systems have been developed,such as QBIC [6] , MARS [19], Virage [1], Photobook [18], VisualSEEk [23],PicToSeek [7] and PicHunter [5].

    One of the most challenging problems in building a successful image re-trieval system lies in bridging the semantic gap. CBIR systems interpret theuser information needs based on a set of low-level visual features (color, tex-ture, shape) extracted from the images. However, these features may not cor-respond to the user interpretation and understanding of image contents. Thus,a semantic gap exists between the high-level concepts and the low-level fea-tures in CBIR. In view of this, relevance feedback has been introduced toaddress these problems [2, 5, 7, 8, 10, 12, 13, 17, 1921, 25, 27, 2934]. The mainidea is that the user is incorporated into the retrieval systems to providehis/her evaluation on the retrieval results. This enables the systems to learnfrom the feedbacks in order to retrieve a new set of images that better satisfythe user information requirement. Many relevance feedback algorithms havebeen adopted in CBIR systems and demonstrated considerable performanceimprovement [2, 5, 7, 8, 10, 12, 13, 17, 1921, 25, 27, 2934]. Some well-knownmethods include query renement [19], feature re-weighting [10, 20], statisti-cal learning [5, 25, 29], neural networks [12, 13, 17, 33, 34], and support vectormachine (SVM) [2,8, 27,30,31].

    Query renement and feature re-weighting are two widely used relevancefeedback methods in CBIR. Query renement tries to reach an optimal querypoint by moving it towards relevant images and away from the irrelevant ones.This technique has been implemented in many CBIR systems. The best-knownimplementation is the multimedia analysis and retrieval system (MARS) [19].Re-weighting technique updates the weights of the feature vectors so as toemphasize the features components that help to retrieve relevant images,while de-emphasize those that hinder this process. It uses a heuristic formu-lation to adjust the weight parameters empirically. Statistical learning hasbeen developed by modeling the probability distribution of images in thedatabase [5, 29]. Bayesian classier has been proposed that treats positiveand negative feedback samples with dierent strategies [25]. Positive exam-ples are used to estimate a Gaussian distribution that represents the desiredimages for a given query, while the negative examples are used to modify theranking of the retrieved candidates. Neural networks have been adopted ininteractive image retrieval in view of their learning capability and generaliza-tion power [12,13,17,33,34]. A fuzzy radial basis function network (FRBFN)has been proposed to learn the users fuzzy perception of visual contents us-ing fuzzy relevance feedback [33, 34]. It provides a natural way to model theuser interpretation of image similarity. Another popular relevance feedbackmethod in CBIR is centered on SVM [2,8,27,30,31]. SVM is a powerful learn-ing machine. It nds an optimal separating hyperplane that maximizes themargin between two classes in a kernel-induced feature space. SVM-based

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 273

    active learning has been proposed to carefully select samples shown to theusers for labeling. This is in order to achieve maximal information gain indecision-making [27]. It chooses the unseen images that are closest to theSVM decision hyperplane as the most informative images for feedback.

    1.2 Related Work

    Despite the previous works on relevance feedback for CBIR systems, it is stilla challenging task to develop eective and ecient interactive mechanismsto yield satisfactory retrieval performance. One key diculty associated withrelevance feedback is the lack of sucient labeled images since users usuallydo not have the patience to label a large number of images. Therefore, theperformance of relevance feedback methods is often constrained by the lim-ited number of training samples. To deal with this problem, some works havebeen done to incorporate the unlabeled data to improve the learning perfor-mance. Discriminant Expectation Maximization (D-EM) algorithm has beenintroduced to incorporate the unlabeled samples to estimate the underlyingprobability distribution [32]. The results are promising, but the computationalcomplexity can be signicant for large databases. Transductive support vec-tor machine (TSVM) for text classication has been proposed to tackle theproblem by incorporating the unlabeled data [11]. It has also been applied forimage retrieval [30]. The method proposes to incorporate unlabeled imagesto train an initial SVM, followed by standard active learning. It is, however,observed that the performance of this method may be unstable in some cases.Incorporating prior knowledge into the SVM has also been introduced to re-solve the small sample problem [31]. All these proposed methods show somepromising outcomes, however few can learn from the labeled and unlabeleddata eectively.

    To nd solutions to solve the small sample problem faced by current rele-vance feedback methods, we develop a soft labeling framework in this chapterthat integrates the advantages of soft-labeling and fuzzy support vector ma-chine (FSVM). It exploits inexpensive unlabeled data to augment the small setof labeled data, hence potentially improves the retrieval performance. This isin contrast to most existing relevance feedback approaches in CBIR systemsthat are concerned with the use of labeled data only. The useful unlabeledimages are identied through exploiting the characteristics of the labeled im-ages. Dierent soft-labels of relevant or irrelevant are then automaticallypropagated to the selected unlabeled images by a label propagation process.As these images are not labeled explicitly by the users, there is a poten-tial imprecision embedded in their class information. In view of this, a fuzzymembership function is employed to estimate the class membership of thesoft-labeled images. The fuzzy information is then integrated into the FSVMfor active learning.

    The organization for the rest of this chapter is outlined as follows. Section 2presents an overview of the proposed soft-labeling framework. In Sect. 3, we

  • 274 K. Wu and K.-H. Yap

    describe FSVM and discuss the soft-label estimation scheme in details. Wefurther explore the fuzzy membership function which is developed to deter-mine the implicit class membership of the soft-labeled images. Experimentalresults using the proposed method are discussed in Sect. 4. Finally, concludingremarks are given in Sect. 5.

    2 Overview of the Proposed Soft-Labeling Framework

    2.1 Overview of the System

    The proposed soft labeling framework is a unied framework that incorporatessoft-labeling into FSVM in the context of CBIR. The general overview of theframework is summarized in Fig. 1.

    The main processing of the system involves the oine and online stages.Oine processing includes feature extraction, representation, and organiza-tion. Online processing is the interaction between the user and the system.User rst submits his/her query to the system through query-by-example(QBE). The system performs the K-nearest neighbor (K-NN) search usingthe Euclidean distance for similarity matching. The top l0 most similar im-ages are shown to the user for feedback. The user provides the feedback onthe l0 images as either relevant or irrelevant. Based on the l0 labeled images,an initial SVM classier is trained. The SVM active learning is employedby selecting l unlabeled images that are closest to the current SVM decisionboundary for the user to label. The l labeled images are then added to thepreviously labeled training set. Next, a two-stage clustering is performed sep-arately on the labeled relevant and irrelevant images. The formed clusters areused for unlabeled image selection and soft-label assignment. A fuzzy mem-bership function is further developed to estimate the class membership of thesoft-labeled images. An FSVM is then trained by emphasizing the labeled im-ages over the soft-labeled images during training. A new ranked list of imageswhich better approximates the users preferences is obtained and presentedto the user. If the user is unsatised with the retrieval results, SVM activelearning is utilized to present another set of l unlabeled images that are themost informative for the user to label. This feedback process repeats until theuser is satised with the retrieval results.

    2.2 Feature Extraction

    Feature extraction and representation is a fundamental process in CBIR sys-tems. Color, texture, and shape are the most frequently used visual featuresin current CBIR systems. Each feature may have several representations. Nosingle best representation exists for a given feature due to human perceptualsubjectivity. Dierent representations characterize dierent aspects of the fea-ture. The general guideline for the selection of low-level features when design-ing a CBIR system should obey the following criteria: perceptual similarity,

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 275

    Two-stage clustering for unlabeled image selectionand soft-label assignment

    Perform k-nearest neighbors (k-NN) search andreturn the top l0 most similar images

    User feedback of relevant and irrelevant images on the l0 images,

    and train an initial SVM classifier

    Employ SVM active learning to select lunlabeled images

    User feedback of the l images, add them to the previously labeled training set

    Evaluate the soft relevance membership ofthe soft-labeled images

    Haveterminationcriteria been





    Train an FSVM using the hybrid of labeled andsoft-labeled images

    Retrieve new relevant images from databasebased on trained FSVM

    Fig. 1. General overview of the proposed soft-labeling framework

    eciency, stability, and scalability. Based on this guideline and literature sur-vey on dierent features, color and texture are employed in this work. Colorhistogram [26], color moments [16] and color auto-correlogram [10] are chosenas the color feature representation, while wavelet moments [24] and Gaborwavelet [15] are selected as the texture feature representation.

  • 276 K. Wu and K.-H. Yap

    Color histogram, representing the rst-order color distribution in an im-age, is one of the most widely used color descriptors. It is easy to compute,invariant to rotation, translation, and viewing axis. We implement the colorhistogram by rst converting the RGB representation of each image into itsHSV equivalence. Then, each H, S, V component is uniformly quantized into8, 2, 2 bins respectively to get a 32-dimensional feature vector.

    Color moments have been proposed to overcome the quantization eectsin color histogram. It characterizes the color distribution of an image by itsmoments (mean, variance and skewness). In this study, the rst two moments(mean and variance) from the R, G, B color channel are extracted as the colorfeature representation to form a six-dimensional feature vector.

    Color auto-correlogram is a two-dimensional spatial extension of color his-togram. Color histogram does not provide any spatial information, thereforeimages with similar histograms may have dierent appearances. Color correl-ogram integrates spatial information with the color histogram by construct-ing a color co-occurrence matrix indexed by color pairs and distance, witheach entry (i, j) representing the probability of nding a pixel of color j at adistance k from a pixel of color i. The storage requirement for a co-occurrencematrix is signicant, therefore only its main diagonal is computed and stored,which is known as color auto-correlogram. The auto-correlogram of the imageI for color Ci, is given as:


    (I ) = Pr [|p1 p2| = k, p2 ICi |p1 ICi ] , (1)where p1 is a pixel of color Ci in the image I , p2 is another pixel of the samecolor Ci with a distance of k away from p1. D8 distance (chessboard distance)is chosen as the distance measure: D8(p, q) = max(|px qx|, |py qy|), whichis the greater distance in the x- or y-direction.

    Wavelet moments describe the global texture properties of images usingthe energy of discrete wavelet transform subbands. It is a simple wavelet-transform feature of an image that is constructed using the mean and stan-dard deviation of the energy distribution at each decomposition level. Thisin turn corresponds to the distribution of edges in the horizontal, vertical,and diagonal directions at dierent resolutions. In this study, we employ theDaubechies wavelet transform with a three-level decomposition. The meanand standard deviation of the transform coecients are used to compose a20-dimensional feature vector.

    Gabor wavelet is widely adopted to extract texture features, and has beenshown to be very ecient. Basically, Gabor lters are a group of wavelets, witheach wavelet capturing energy at a specic frequency and a specic direction.Expanding a signal using this basis provides a localized frequency description,therefore capturing local features/energy of the signal. A 2D Gabor functiong(x, y) is dened as:

    g(x, y) =(









    )+ 2jWx

    ]. (2)

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 277

    The self-similar functions are obtained by appropriate dilations and rotationsof g(x, y)through the generating function:

    gmn(x, y) = amg(x, y)x = am(x cos n + y sin n)y = am(x sin n + y cos n)


    where a > 1, m and n specify the scale and orientation of the wavelet re-spectively, W is the modulation frequency. The half peak radial band-widthis chosen to be octave, which determines x and y. In this study, Gaborwavelet lters spanning four scales: 0.05, 0.1, 0.2 and 0.4 and six orientations:0 = 0, n+1 = n+/6 are used. For a given image I(x, y), its Gabor wavelettransform is dened by:

    Wmn(x, y) =

    I(x1,y1)gmn(x x1,y y1)dx1y1, (4)

    where denotes complex conjugation. The mean and standard deviation ofthe transform coecient magnitudes are used to form a 48-dimensional featurevector.

    After all the color and texture features have been extracted oine, we con-catenate the feature elements from all the individual features into an overallfeature vector with a dimension of 170. Since dierent components within afeature vector may have dierent physical quantities, their magnitudes can beinconsistent, thereby biasing the similarity measure. We perform a Gaussiannormalization to all the feature vectors to ensure equal emphasis is put oneach component within a feature vector [20].

    3 Soft-Labeling Fuzzy Support Vector Machine

    3.1 Proposed Concept of Soft Labeling

    Conventional relevance feedback in interactive CBIR systems uses only thelabeled images for learning. However, the labeled images are available only insmall quantities since it is not user-friendly to let the users label too manyimages for feedback. This results in the small sample problem where learningfrom such a small number of training samples may not produce good retrievalresults, even for powerful learning machine such as SVM. Therefore, it is im-perative to nd solutions to solve the small sample problem faced by relevancefeedback.

    Considering that obtaining a large number of labeled images is labor in-tensive while unlabeled images are readily available and abundant, we proposeto augment the available labeled images by making use of the potential roleof unlabeled images. It is worth noting that unlabeled images can degrade theperformance if used improperly. Consequently, they should be carefully cho-sen so that they will be benecial to the retrieval performance. Each selected

  • 278 K. Wu and K.-H. Yap

    unlabeled image is assigned a soft-label of either relevant or irrelevantbased on an algorithm to be explained in Sect. 3.2. These soft-labeled imagesare fuzzy in nature since they are not explicitly labeled by the users. Thereforethe potential imprecision embedded in their class information should be takeninto consideration. We employ a fuzzy membership function to determine thedegree of uncertainty for each soft-labeled image, hence putting into contextthe relative importance of these images. These soft-labeled samples are thencombined with those labeled images to train the FSVM.

    3.2 Selection of Unlabeled Images and Label Propagation

    In this work, we present a method to select the unlabeled images by studyingthe characteristics of the labeled images. The selection criterion is to determinecertain informative samples among the unlabeled ones which are similar tothe labeled images in terms of the visual features for soft-labeling and fuzzymembership estimation. The enlarged hybrid data set consisting of both soft-labeled and explicitly labeled samples is then utilized to train the FSVM.

    It is observed that the labeled images usually exhibit local characteristicsof image similarity. To exploit this property, it is desirable to adopt a multi-cluster local modeling strategy. Taking into account the local multi-clusternature of image similarity, we employ a two-stage clustering process to deter-mine the local clusters. The labeled samples are clustered according to theirtypes: relevant or irrelevant. K-means clustering is one of the most widelyused clustering algorithms. It groups the samples into K clusters by usingan iterative algorithm that minimizes the sum of distances from each sampleto its respective cluster centroid for all the clusters. Notwithstanding its at-tractive features, K-means clustering requires a specied number of clustersin advance and is sensitive to the initial estimates of the clusters. To rectifythis diculty, we adopt a two-stage clustering strategy in this work. First,subtractive clustering is employed as a preprocessing step to estimate thenumber and structure of clusters as it is fast, ecient, and does not requirethe number of clusters to be specied a priori [3]. These estimates are thenemployed by K-means to perform clustering based on iterative optimizationin the second stage.

    Subtractive clustering assumes each sample as a potential cluster center. Itcomputes a potential eld which determines the likelihood of a sample beinga cluster center. Let {xi}ni=1 RR be a set of R-dimensional samples to beclustered. The initial potential function Pk=1(i) of the ith sample xi, expressedin terms of the Euclidean distance to the other samples xj , is dened as:

    Pk=1(i) =n



    (xi xj



    )i = 1, . . . , n, (5)

    where ra is a positive coecient dening the range of the eld. The poten-tial function has large values at densely populated neighborhoods, suggesting

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 279

    strong likelihood that clusters may exist in these regions. The subtractiveclustering algorithm can be summarized as follows:

    1. Compute Pk=1(i) for i = 1, . . . , n and select the sample with the highestpotential as the rst cluster center. Let x1 and P

    1 denote the rst cluster

    center and its potential, respectively.2. For k = 2, . . . ,K, update the potential of each sample according to:

    Pk(i) = Pk1(i) P k1 exp

    xi xk12


    i = 1, . . . , n, (6)

    where xk1 and Pk1 are the (k1)th cluster center and its potential value,

    rb is a positive coecient dening the neighborhood radius for potential re-duction, and K is the maximum number of potential clusters. Equation (6)serves to remove the residual potential of the (k1)th cluster center from thecurrent kth iteration eld. The samples that are close to the (k1)th clus-ter center will experience greater reduction in potential, hence reducing theirlikelihood to be chosen as the next center. Let xk be the sample data with themaximum potential P k in the current kth iteration, the following criteria areused to determine whether it should be selected as the current cluster center:

    if Pk

    P1> A, accept xk as the kth cluster center

    else if Pk

    P1< R, reject xk and terminate the clustering process

    else if Pk

    P1+ dminra 1, accept xk as a cluster center

    else reject xk and set its potential to zero (Pk 0), and repeat the process

    with the sample with the next highest potentialendif

    3. Repeat step 2 until the termination criterion is satised or the maximumnumber of iteration is reached.

    In step 2, A is the acceptance ratio above which a sample will be acceptedas a cluster center, R is the rejection ratio below which a sample will berejected, and dmin is the shortest distance between xk and all previously foundcluster centers. If the potential of the sample falls between the acceptance andrejection ratios, we will accept it only if it achieves a good compromise betweenhaving a reasonable potential and being suciently far from all existing clustercenters.

    After subtractive clustering, we obtain a set of cluster centers, which isused as the initial center estimates for K-means clustering. Two sets of sepa-rate clusters are then obtained, relevant and irrelevant sets after the two-stageclustering. Unlabeled image selection and soft-label assignment are then basedon a similarity measure analogous to the K-NN technique. That is, samplesclose in distance will potentially have similar class labels. For each cluster

  • 280 K. Wu and K.-H. Yap

    formed by the labeled images using the two-stage clustering scheme, K near-est unlabeled neighbors are chosen based on their Euclidean distances to thecenter of the respective labeled cluster. The label (relevant or irrelevant) ofeach labeled cluster is then propagated to the unlabeled neighbors. This isreferred to as soft-labeling process. As the computational cost will increasewith respect to the number of soft-labeled images, therefore, only the mostsimilar neighbor for each cluster is selected in this work.

    3.3 Soft Relevance Membership Estimation for Soft-LabeledImages

    In consideration of the potential fuzziness associated with the soft-labeled im-ages, our objective here is to determine a soft relevance membership functiong(xk) : RR [0, 1] that assesses each soft-labeled image xk and assigns ita proper relevance value between zero and one. The estimated relevance ofthe soft-labeled images is then used in FSVM training. In this study, g(xk) isdetermined by two measures, fC(xk) and fA(xk). First, since clustering hasbeen performed on each positive (relevant) and negative (irrelevant) class sep-arately to get multiple clusters per class, the obtained clusters in each classcan be employed to generate the membership value of xk, namely fC(xk). Fur-ther, the agreement between the predicted label obtained in Sect. 3.2, and thepredicted label obtained from the trained FSVM can also be utilized to assessthe degree of relevance of the soft-labeled samples, namely fA(xk). These twomeasures aecting the fuzzy membership are combined together to producethe nal soft relevance estimate, namely:

    g(xk) = fC(xk)fA(xk). (7)

    Let vSi denote the center of the ith cluster with the same class label as thesoft-labeled image xk, while vOj denote the center of the jth cluster with theopposite class label to xk. min

    i(xkvSi)T(xkvSi) and min


    vOj) represent the distance between xk and the nearest cluster centers withthe same and opposite class labels, respectively. We then dene the followingexpression:

    Q(xk) =min

    i(xk vSi)T(xk vSi)


    (xk vOj)T(xk vOj). (8)

    Intuitively, the closer a soft-labeled image is to the nearest cluster of the sameclass label, the higher is its degree of relevance. In contrast, the closer a soft-labeled image is to the nearest cluster of the opposite class label, the lower isits degree of relevance. Based on this argument, an exponentially based fuzzyfunction is selected:

    fC(xk) ={

    exp (a1Q(xk)) if Q(xk) < 10 otherwise , (9)

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 281

    where a1 > 0 is a scaling factor. This membership function is divided intotwo scenarios. If the distance ratio is smaller than 1, suggesting that the soft-labeled image is closer to the nearest cluster with the same class label, then wewill estimate its soft relevance. Otherwise, if the soft-labeled image is closerto the nearest cluster with the opposite class label, a zero value is assigned.

    The second factor of the fuzzy function is chosen as a sigmoid function asfollows:

    fA(xk) =

    11 + exp(a2y) soft-label is positive

    11 + exp(a2y)


    where a2 > 0 is a scaling factor. y is the directed distance of the soft-labeledimage xk to the FSVM boundary (the decision function output of FSVMfor the soft-labeled image xk). We will explain the rationale of the fuzzyexpression in (10) by rst considering that the soft-label of the selected imagehas been determined as positive in Sect. 3.2. In this case, the upper equationin (10) will be used. If y has a large positive value, this will suggest that it ismost likely to be a relevant image. Since there is a strong agreement betweenthe predicted soft-label from Sect. 3.2 and the predicted class label using thetrained FSVM, its fuzzy membership value should be set to a large valueclose to unity. If y has a large negative value, this will suggest that it is mostlikely to be an irrelevant image. Since there is a strong disagreement betweenthe predicted soft-label from Sect. 3.2 and the predicted class label using thetrained FSVM, its fuzzy membership value should be set to a small value closeto zero. The same arguments apply when the soft-label of the selected imagehas been determined to be negative in Sect. 3.2.

    3.4 Support Vector Machine (SVM) and Active Learning

    SVM is an implementation of the method of structural risk minimization(SRM) [28]. This induction principle is based on the fact that the error rate ofa learning machine on test data (i.e. the generalization error rate) is boundedby the sum of the training error rate and a term that depends on the VapnikChervonenkis (VC) dimension. The basic idea of SVM involves rst trans-forming data in the original input space to higher dimensional feature spaceby utilizing the technique known as kernel trick. In doing so, nonlinearlyseparable data can be transformed into a linearly separable feature space. Anoptimal decision hyperplane can then be constructed in this high dimensionalfeature space by maximizing the margin of separation between positive andnegative samples. Linear decision boundary constructed in the feature spacecorresponds to nonlinear decision boundary in the input space. By the use ofa kernel function, it is possible to compute the separating hyperplane with-out explicitly carrying out the mapping in the feature space. The optimalhyperplane is determined by solving a quadratic programming (QP) problem,

  • 282 K. Wu and K.-H. Yap

    which can be converted to its dual problem by introducing Lagrangian multi-pliers. The training data points that are nearest to the separating hyperplaneare called support vectors. The optimal hyperplane is specied only by thesupport vectors.

    Let S = {(xi, yi)}ni=1 be a set of n training samples, where xi RR is an R-dimensional sample in the input space, and yi {1, 1} is the class label of xi.SVM rst transforms data in the original input space into higher dimensionalfeature space through a mapping function z = (x). It then nds the optimalseparating hyperplane with minimal classication errors. The hyperplane canbe represented as:

    w z + b = 0, (11)where w is the normal vector of the hyperplane, and b is the bias which is ascalar. In particular, the set S is said to be linearly separable if the followinginequalities hold for all training data in S:{

    w zi + b 1 if yi = 1w zi + b 1 if yi = 1, i = 1, . . . , n. (12)

    For linearly separable case, the optimal hyperplane can be obtained by max-imizing the margin of separation between the two classes. Maximizing themargin leads to solving the following constrained optimization problem:


    subject to yi(w zi + b) 1, i = 1, . . . , n(13)

    This optimization problem can be solved by QP. However, for the linearlynonseparable case where the inequalities in (12) does not hold for some datapoints in S, a modication to the original SVM formulation can be made byintroducing nonnegative variables {i}ni=1. In this case, the margin of separa-tion is said to be soft. The constraint in (12) is modied to:

    yi(w zi + b) 1 i, i = 1, . . . , n. (14)

    The {i}ni=1 are called slack variables. They measure the deviation of adata point from the ideal condition of pattern separability. Misclassicationsoccur when i > 1. The optimal separating hyperplane is then found by solvingthe following constrained optimization problem:

    minimize12w2 + C



    subject to yi(w zi + b) 1 i, i 0, i = 1, . . . , n(15)

    where C is the regularization parameter controlling the tradeo betweenmargin maximization and classication error. Larger value of C producesnarrower-margin hyperplane with less misclassications. The optimization

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 283

    problem can be transformed into the following equivalent dual problem usingLagrange multipliers:



    i 12n



    ijyiyjzi zj

    subject ton


    yii = 0, 0 i C, i = 1, . . . , n. (16)

    where i is the Lagrange multiplier associated with the constraints in (14).The data points that correspond with i > 0 are called support vectors. Theoptimal solution for the weight vector w is a linear combination of the trainingsamples which is given by:

    w =n


    iyizi. (17)

    The decision function of the SVM can then be obtained as:

    f(x) = w z + b =n


    iyizi z + b =n


    iyi(xi) (x) + b. (18)

    It is noted that both the construction of the optimal hyperplane in (16)and the evaluation of the decision function in (18) only require the evalua-tion of dot products (xi) (xj) or (xi) (x). This implies that we donot necessarily need to know about in explicit form. Instead, a functionK(, )called kernel function is introduced that can compute the inner prod-uct of two data points in the feature space, i.e.K(xi,x) = (xi) (x). Thereare three common types of kernels used in SVM including polynomial kernel,radial basis function kernel and sigmoid kernel. Using this kernel trick, thedual optimization problem in (16) becomes:



    i 12n




    subject ton


    yii = 0, 0 i C, i = 1, . . . , n(19)

    And we can construct the optimal hyperplane in the feature space withouthaving to know the mapping :

    f(x) =n


    iyiK(xi,x) + b. (20)

    Active learning is designed to achieve maximal information gain or mini-mize uncertainty in decision making. It selects the most informative samplesto query the users for labeling. Among the various active learning techniques,SVM-based active learning is one of the most promising methods currentlyavailable [27]. It selects samples that are closest to the current SVM deci-sion boundary as the most informative points. Samples that are farthest awayfrom the boundary and on the positive side are considered as the most relevantimages. The same selection strategy is adopted in this work.

  • 284 K. Wu and K.-H. Yap

    3.5 Fuzzy Support Vector Machine (FSVM)

    Because of the nice properties of SVM, it has been successfully utilized in manyreal-world applications. However, SVM is still limited to crisp classicationwhere each training example belongs to either one or the other class withequal importance. There exist situations where the training samples do not fallneatly into discrete classes. They may belong to dierent classes with dierentdegrees of membership. To solve this problem, FSVM has been developed [14].FSVM is an extended version of SVM that takes into consideration dierentimportance of training data. It exhibits the following properties that motivateus to adopt it in our framework: integration of fuzzy data, strong theoreticalfoundation, and excellent generalization power.

    In FSVM, each training sample is associated with a fuzzy membershipvalue {i}ni=1 [0, 1]. The membership value i reects the delity of the data,or in other words, how condent we are about the actual class information ofthe data. The higher its value, the more condent we are about its class label.The optimization problem of the FSVM is formulated as follows [14]:

    minimize12w2 + C



    subject to yi(w zi + b) 1 i, i 0, i = 1, . . . , n(21)

    It is noted that the error term i is scaled by the fuzzy membership valuei. The fuzzy membership values are used to weigh the soft penalty termin the cost function of SVM. The weighted soft penalty term reects therelative delity of the training samples during training. Important sampleswith larger membership values will have more impact in the FSVM trainingthan those with smaller values. The detailed determination of the membershipvalue {i}ni=1 has been described in Sect. 3.3, that is, k = g(xk).

    Similar to the conventional SVM, the optimization problem of FSVM canbe transformed into its dual problem as follows:



    i 12n




    subject ton


    yii = 0, 0 i iC, i = 1, . . . , n(22)

    Solving (22) will lead to a decision function similar to (20), but with dierentsupport vectors and corresponding weights i.

    4 Experimental Results and Discussion

    4.1 Image Database and User Interface

    The framework is developed on a PC with the following specications: Pen-tium4 2.4-GHz processor, 512-M RAM, Windows XP, and Matlab 6.5. The

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 285

    performance of the framework is evaluated on an image database containing10,000 natural images [4]. It contains 100 dierent semantic categories, whichare predened by the Corel Photo Gallery based on their semantic conceptsas shown in Fig. 2.

    A general overview on the operation of user interface in our retrieval sys-tem is shown in Fig. 3. Initially, the user can select a query image by browsingthrough the image database. The selected query is displayed at the top leftcorner. Next, the user can search the image database by pressing the Searchbutton, and the ten most relevant images are ranked and displayed in a de-

    Fig. 2. Selected sample images from the database

    Fig. 3. Illustration of user interface

  • 286 K. Wu and K.-H. Yap

    scending order of relevance from left to right, and top to bottom. It is notedthat under each displayed image, a pull-down menu is available which enablesthe user to select two possible choices of feedback, relevant and irrelevant, asillustrated in the gure. The user will simply be asked to select each displayedimage as either relevant or irrelevant according to his/her information need.The user can then submit his/her feedback by pressing the Feedback but-ton. The system then learns from the feedback images, and presents a newranked list of images to the user for further feedback. The process continuesuntil the user is satised with the retrieved results.

    The proposed soft-labeling framework can be implemented in practi-cal applications such as image retrieval through bandwidth-limited, display-constrained devices, e.g. mobile phones with camera, where only a small num-ber of images is displayed to the user. For instance, a girl in the zoo sees a foxsquirrel that is of interest to her and would like to nd out more similar squir-rel images. Therefore, she takes a picture of the fox squirrel itself using hermobile phone and sends it as a query to the server. The server then performsa similarity comparison with the images in the database and retrieves a setof images. If the girl is unsatised with the retrieval results, she may providefeedback on the retrieved images displayed on the screen of her mobile phone.Conventional relevance feedback methods are not able to achieve improvedperformance with such small number of feedback samples. In contrast, theproposed framework strives to utilize the unlabeled images to augment theavailable labeled images. In doing so, the girl can get satisfactory retrievalresults within the rst few iterations. Further, if the girl is cooperative andwilling to provide more than one screen of feedback images before seeing theresults, the proposed framework with active learning is of great value. Aftergetting feedbacks for one or more screens of training images, the systems canselect the most informative samples to query the girl for labeling to achievemaximal information gain or minimized uncertainty in decision-making.

    The proposed method is applied in our retrieval system. Subtractive clus-tering is utilized to determine the cluster centers of the relevant and irrelevantimages. It uses the following parameters: ra is set to 0.075 and 0.25 for relevantand irrelevant samples, respectively, with rb = 1.2ra, A = 0.5, and R = 0.2.RBF kernel, K(xi,xj) = exp


    )is used for SVM, where = 3 and

    the regularization parameter C = 100. The following parameters are used forsoft relevance membership estimation of soft-labeled images: a1 = 1, a2 = 3.

    4.2 Performance Evaluation

    In our experiment, we use objective measure to evaluate the performance ofthe proposed soft-labeling method using FSVM. The objective measure isbased on the Corels predened ground truth. That is, the retrieved imagesare judged to be relevant if they come from the same category as the query.One hundred queries with one from each category are selected for evaluation.

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 287

    Fig. 4. The average precision-vs.-recall graphs after the rst iteration of activelearning

    Retrieval performance is evaluated by ranking the database images accordingto their directed distances to the SVM boundary after each active learningiteration. Five iterations of feedbacks are recorded. Precision-vs.-recall curveis a standard performance measure in information retrieval, and is adoptedin our experiment [22]. Precision is the number of retrieved relevant imagesover the total number of retrieved images. Recall is dened as the numberof retrieved relevant images over the total number of relevant images in thecollection. The precision and recall rates are averaged over all the queries.The average precision-vs.-recall (APR) graph after the rst iteration of activelearning for ve initial labeled images (l0 = 5) is shown in Fig. 4. It is observedthat the precision rate decreases with the increase of recall. This means thatwhen more relevant images are retrieved, a higher percentage of irrelevantimages will be probably retrieved.

    In addition, we have adopted another measure called retrieval accuracy toevaluate the retrieval system [9,25]. The performance of the proposed methodis given in Fig. 5 for the case of l0=10. The retrieval accuracy is averagedover the 100 queries. We observe that the retrieval accuracy of the proposedmethod increases quickly in the initial stage. This is a desirable feature sincethe user can obtain satisfactory results quickly. It is worth emphasizing thatthe initial retrieval performance is very important since users often expectquick results and are unwilling to provide much feedback. Hence, reducingthe amount of user feedback while providing good retrieval results is of greatinterests for many CBIR systems. Further, the method reaches a high steady-state retrieval accuracy of 95% in about ve feedback iterations, which is animprovement of 35% over its initial retrieval accuracy.

  • 288 K. Wu and K.-H. Yap

    Fig. 5. Retrieval accuracy in top ten results

    5 Conclusions

    This chapter presents a soft-labeling framework that addresses the small sam-ple problem in interactive CBIR systems. The technique incorporates soft-labeled images into FSVM along with labeled images for eective retrieval.By exploiting the characteristics of the labeled images, soft-labeled imagesare selected through an unsupervised clustering algorithm. Further, the rel-evance of the soft-labeled images is estimated using the fuzzy membershipfunction. FSVM-based active learning is then performed based on the hybridof soft-labeled and explicitly labeled images. Experimental results conrm theeectiveness of our proposed method.


    1. Amarnath G, Ramesh J (1997) Visual information retrieval. Communicationsof ACM 40(5):7079

    2. Chen Y, Zhou XS, Huang TS (2001) One-class SVM for learning in image re-trieval. Proceedings of the IEEE International Conference on Image Processing,pp. 815818

    3. Chiu S (1994) Fuzzy model identication based on cluster estimation. Journalof Intelligent & Fuzzy Systems 2(3):267278

    4. Corel Gallery Magic 65000 (1999) http://www.corel.com5. Cox IJ, Miller ML, Minka TP, Papathomas TV, Yianilos PN (2000) The

    Bayesian image retrieval system, PicHunter: Theory, implementation, and psy-chophysical experiments. IEEE Transactions on Image Processing 9(1):2037

    6. Flickher M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M,Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image andvideo content: The QBIC system. IEEE Computer 28(9):2332

  • Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 289

    7. Gevers T, Smeulders AWM (2000) PicToSeek: Combining color and shape in-variant features for image retrieval. IEEE Transactions on Image Processing9:102119

    8. Guo GD, Jain AK, Ma WY, Zhang HJ (2002) Learning similarity measure fornatural image retrieval with relevance feedback. IEEE Transactions on NeuralNetworks 13(4):811820

    9. He XF, King O, Ma WY, Li MJ, Zhang HJ (2003) Learning a semantic spacefrom users relevance feedback for image retrieval. IEEE Transactions on Cir-cuits and Systems for Video Technology 13:3948

    10. Huang J, Kumar SR, Metra M (1997) Combining supervised learning with colorcorrelograms for content-based image retrieval. Proceedings of ACM Multime-dia, pp. 325334

    11. Joachims T (1999) Transductive inference for text classication using supportvector machines. Proceedings of the International Conference on Machine Learn-ing, pp. 200209

    12. Laaksonen J, Koskela M, Oja E (2002) PicSomself-organizing image retrievalwith MPEG-7 content descriptions. IEEE Transactions on Neural Network13(4):841853

    13. Lee HK, Yoo SI (2001) A neural network-based image retrieval using nonlinearcombination of heterogeneous features. International Journal of ComputationalIntelligence and Applications 1(2):137149

    14. Lin CF and Wang SD (2002) Fuzzy support vector machines. IEEE Transactionson Neural Networks 13(2):464471

    15. Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval ofimage data. IEEE Transactions on Pattern Analysis and Machine Intelligence18:837842

    16. Markus S, Markus O (1995) Similarity of color images. Proceedings of SPIEStorage and Retrieval for Image and Video Databases

    17. Muneesawnag P, Guan L (2002) Automatic machine interactions for content-based image retrieval using a self-organizing tree map architecture. IEEE Trans-actions on Neural Networks 13(4):821834

    18. Pentland A, Picard R, Sclaro S (1994) Photobook: tools for content-basedmanipulation of image databases. Proceedings of SPIE 2185:3447

    19. Rui Y, Huang TS, Mehrotra S (1997) Content-based image retrieval with rele-vance feedback in MARS. IEEE International Conference on Image Processing,Washington DC, USA, pp. 815818

    20. Rui Y, Huang TS, Ortega M, Mehrotra S (1998) Relevance feedback: a powertool for interactive content-based image retrieval. IEEE Transactions on Circuitsand Video Technology 8(5):644655

    21. Rui Y, Huang TS (2000) Optimizing learning in image retrieval. Proceedings ofIEEE International Conference on Computer Vision and Pattern Recognition1:236243

    22. Salton G, McGill MJ (1982) Introduction to Modern Information Retrieval. NewYork: McGraw-Hill

    23. Smith JR, Chang SF (1996) VisualSEEk: a fully automated content based imagequery system. Proceedings ACM Multimedia

    24. Smith JR, Chang SF (1996) Automated binary texture feature sets for imageretrieval. Proceedings of the International Conference on Acoustics, Speech andSignal Processing, Atlanta, GA

  • 290 K. Wu and K.-H. Yap

    25. Su Z, Zhang HJ, Li S, Ma SP (2003) Relevance feedback in content-based imageretrieval: Bayesian framework, feature subspaces, and progressive learning. IEEETransactions on Image Processing 12:924937

    26. Swain M, Ballard D (1991) Color indexing. International Journal of ComputerVision 7(1):1132

    27. Tong S, Chang E (2001) Support vector machine active leaning for image re-trieval. Proceedings of the Ninth ACM Conference on Multimedia

    28. Vapnik VN (1995) The Nature of Statistical Learning Theory. New York:Springer-Verlag

    29. Vasconcelos N, Lippman A (1999) Learning from user feedback in image re-trieval systems. Proceedings of Neural Information Processing Systems, Denver,Colorado

    30. Wang L, Chan KL (2003) Bootstrapping SVM active learning by incorporatingunlabelled images for image retrieval. Proceedings of the IEEE InternationalConference on Computer Vision and Pattern Recognition, pp. 629634

    31. Wang L, Chan KL (2004) Incorporating prior knowledge into SVM for imageretrieval. Proceedings of the IEEE International Conference on Pattern Recog-nition, pp. 981984.

    32. Wu Y, Tian Q, Huang TS (2000) Discriminant-EM algorithm with applicationto image retrieval. Proceedings of IEEE International Conference on ComputerVision and Pattern Recognition, South Carolina

    33. Yap KH, Wu K (2005) Fuzzy relevance feedback in content-based image re-trieval systems using radial basis function network. Proceedings of the IEEEInternational Conference Multimedia and Expo, Amsterdam, The Netherlands,pp. 177180

    34. Yap KH, Wu K (2005) A soft relevance framework in content-based image re-trieval systems. IEEE Transactions on Circuits and Systems for Video Technol-ogy 15(12):15571568

  • Temporal Texture Characterization: A Review

    Ashfaqur Rahman1 and Manzur Murshed2

    1 Department of Computer Science, American International University BangladeshDhaka, Bangladeshashfaqur@aiub.edu

    2 Gippsland School of Information Technology, Monash UniversityChurchill, VIC, Australiamanzur.murshed@infotech.monash.edu.au

    Summary. A large class of objects commonly experienced in a real world scenarioexhibits characteristic motion with certain form of regularities. Contemporary lit-erature coined the term temporal texture1 to identify image sequences of suchmotion patterns that exhibit spatiotemporal regularity. The study of temporal tex-tures dates back to the early nineties. Many researchers in the computer visioncommunity have formulated techniques to analyse temporal textures. This chapteraims to provide a comprehensive literature survey of the existing temporal texturecharacterization techniques.

    1 Introduction

    Temporal textures are textures with motion like real world image sequences ofsea-waves, smoke, re, etc. that possess some stationary properties over spaceand time. The motion assembly by a ock of ying birds, water streams, ut-tering leaves, and waving ags also serve to illustrate such motion. Temporaltexture characterization is of vital importance to computer vision, electronicentertainment, and content-based video coding research with a number ofpotential applications in areas including recognition (automated surveillanceand industrial monitoring), synthesis (animation and computer games), andsegmentation (robot navigation and MPEG-4).

    The phenomena commonly observed in temporal textures have promptedmany researchers to formulate techniques to analyse these distinctive motionpatterns. Research is mostly devoted towards developing features and modelsfor characterizing temporal texture motion patterns, as observed in the cur-rent literature. Our main focus here is on brieng the working principles oftemporal texture characterization techniques. Besides characterization there

    1 Some authors used the term dynamic texture [6] to identify similar motionpatterns.

    A. Rahman and M. Murshed: Temporal Texture Characterization: A Review, Studies in

    Computational Intelligence (SCI) 96, 291316 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 292 A. Rahman and M. Murshed

    are some research works in recent times on synthesis, coding, segmentation,and retrieval of temporal texture image sequences.

    The purpose of characterization is to choose a set of characteristic featuresor dene a mathematical model from the underlying texture so that imagesequences with similar textures are classied in one class (group). Characteri-zation of temporal textures is performed from the spatiotemporal distributionof dynamics over the image sequences by extracting characteristic spatiotem-poral features. Extraction of features plays an important role in the accuracyof classication. As temporal textures exhibit spatiotemporal regularity ofdynamics with indeterminate spatial and temporal extent, both the spatialand the temporal domain need to be explored exhaustively. Moreover, fromthe real time application point of view, the characterization process has to bequick enough for time sensitive applications. In this chapter we elaborate onthe diverse features used by dierent characterization techniques and analysetheir eectiveness in utilizing the timespace dynamics.

    This chapter is organized as follows. We explain some background conceptsand algorithms in Sect. 2 that are frequently used, while discussing dierenttemporal texture analysis techniques. The review is presented in Sect. 3, andSect. 4 concludes this chapter.

    2 Background

    In this section we elaborate some basic concepts essential to comprehendthe detailed working principle of dierent temporal texture characterizationtechniques. In Sect. 2.1, we dene an image sequence. Some motion estimationapproaches most commonly used by contemporary characterization techniquesare explained in Sect. 2.2. Many temporal texture characterization techniquesoperate on computed motion, and the resulting motion frame sequence isillustrated in Sect. 2.3. In Sect. 2.4, one of the most commonly used motiondistribution statistics, namely the motion co-occurrence matrix, is dened.Section 2.5 describes two standard temporal texture datasets commonly usedby researchers in the experiments.

    2.1 Image Sequence

    A digital image is a collection of picture elements (pixel or pel) that are usuallyarranged in a rectangular grid. The number of pixels in each column and rowof the grid constitutes the resolution (width height) of the image, and a pixelis identied by its Cartesian coordinate in the grid. Various colour models areused to distinguish pixels numerically. Of these the most commonly used RGBmodel uses three primary colour (red, green, blue) components, while the HSBmodel, the most intuitive to human perception, uses hue (colour), saturation(concentration of colour), and brightness (intensity) components to representeach pixel. The grayscale model uses just the intensity component and it

  • Temporal Texture Characterization: A Review 293

    Fig. 1. An 8-bit grayscale image of an eye captured in resolution 50 42 pixelsand printed (a) dot-for-dot and (b) enlarged without altering the resolution

    is widely favoured by the signal processing researchers to avoid unnecessarycomplications due to retaining colour information, especially for cases wherethe intensity information is sucient such as temporal texture classication.

    Resolution plays a signicant role on the perceived quality of the image,especially in the context of its physical size, as evident in Fig. 1 where thesame 50 42 pixel 8-bit grayscale image (intensity value of each pixel isdrawn from the range [0, 28 1]) is printed in two dierent sizes. Note thatresolution of an image can be altered using subsampling or supersamplingwith interpolation to match the physical size (not applied in Fig. 1b). Butthis requires extra processing, and the quality would not be as good had theimage been captured in that (altered) resolution.

    Now consider a sensor located in a specic position of the three dimen-sional (3D) world space, capturing images (frames) about the scene, one afteranother, at a specied frame rate. As time goes by, the images form a sequence,which can be expressed with a brightness function It(x, y) representing theintensity of the pixel at coordinate (x, y) in the image I captured at time t.A digital video is a tting example of image sequence where images are nor-mally captured at a high enough frame rate (e.g., 25 frames per second inPAL) so that the persistence of vision (0.1 s for most human being) can beexploited to create the illusion of motion.

    2.2 Motion Estimation

    In the eld of signal processing, motion analysis is mainly concerned with the2D motion in the image plane. The translational model is most frequently usedin the eld, assuming that the change between two successive frames is dueto the motion of moving objects during the time interval between the frames.In many cases, as long as the frame rate is high enough, the assumption isvalid. By motion analysis, we thus mean the estimation of this translationalmotion in the form of displacement or velocity vectors.

    There are two kinds of techniques in 2D motion analysis: correlation anddierential techniques. The rst one belongs to the group of region match-ing, whereas dierential techniques are used to compute pixel motion widelyknown as the optical ow. With region matching, the current frame is divided

  • 294 A. Rahman and M. Murshed

    into non-overlapping regions of interest, and for each region the best matchis searched in the reference frame. Both optical ow and region matchingtechniques are now discussed in detail in the following sections.

    Optical Flow

    Optical ow is referred to as the 2D distribution of apparent velocities of themovement of intensity patterns in an image plane. In other words, an opticalow eld consists of a dense velocity eld with one velocity vector per pixel inthe image plane. If the time interval between two successive frames is known,then velocity vectors and displacement vectors can be computed from one setto the other. In this sense, optical ow is a technique used for displacementestimation.

    As optical ow is caused by the movement of intensity patterns ratherthan the objects motion, 2D motion and optical ow are generally dierent.Imagine a uniform sphere rotating with constant speed in the scene. Assumethe luminance and all other conditions do not change at all when frames arecaptured. As there is no change in the brightness patterns, the optical owis zero; whereas the 2D motion eld is obviously not zero. Thus optical owcannot be estimated based on image intensities alone unless an additionalconstraint, e.g., smoothness of the contour [46], is imposed. Such constraintsare either dicult to implement in practice or are not true over the entireimage.

    Apart from the above-mentioned diculty, the estimation of motion usingoptical ow usually involves iterations that require a long processing time.This may generate a large amount of overheads, rendering a recognition taskinecient. Although there are some near real time optical ow estimationalgorithms [2,3], the quality of the estimated motion is not adequate to classifytemporal textures accurately [20].

    One obvious alternative for real time motion estimation is to estimate theapproximated normal ow, which is orthogonal to the contour and thus thegradient parallel component of the optical ow. It takes only three partialderivatives of the spatiotemporal image intensity function I to estimate thenormal ow. Although the full displacement is not recoverable, partial owprovides sucient information for the purpose of motion-based recognition.

    Computation of normal ow from an image sequence can be explained byderiving a brightness invariance equation. If we assume that image intensityat a pixel (x, y) in the image plane remains unchanged over time t and t+t,we may write [22,46]

    It(x, y) = It+t(x + x, y + y), (1)

    where t, x, and y are miniscule time, horizontal displacement and verticaldisplacement. By expanding this equation and ignoring the higher order terms,we get

  • Temporal Texture Characterization: A Review 295


    x+ y


    y+ t


    t= 0 , (2)

    where Ix ,Iy , and

    It are partial derivatives of the intensity function with

    respect to variables x, y and t. Dividing the equation by t, we obtain












    t= 0 (3)

    v grad(I) + It

    = 0 , (4)

    where v =(

    xt ,


    )is the optical ow velocity and grad(I) =

    (Ix ,



    its gradient. Without any additional constraint, it is impossible to calculatev from (4), as this Linear equation has two unknowns: x- and y- componentof v. This is formally known as the aperture problem. The gradient parallelcomponent of v, i.e., normal ow vN, can however be computed from (4) as

    vN =It(




    )2u , (5)

    where u is the unit vector along the direction of gradient grad(I). The normalow eld is fast to compute [46] and can be directly estimated without anyiterative scheme used by complete optical ow (complete ow) estimationmethods [22]. Moreover, it contains both temporal and structural informationon temporal textures; temporal information is related to moving edges, whilespatial information is linked to the edge gradient vectors. Researchers are thusmotivated to use normal ow to characterize temporal textures, as evidencedin the literature.

    Block Matching

    The block matching motion estimation approach, where a motion vector isassociated with a block of connected pixels rather than with an individualpixel, is prevalent in the video coding standards such as H.26X and MPEG1/2/4 ([43,46,47]) due to increased coding eciency, as fewer motion vectorsare coded. With this approach, a frame is partitioned into non-overlappedblocks (termed macroblocks in video coding that are usually rectangular andof xed size). For each block thus generated is assumed to move as one, i.e., allpixels in a block share the same displacement vector. For each block, its bestmatch is found within a search window in the previous frame with maximumcorrelation, and the motion vector is computed from the relative displace-ment. Although block based motion vectors are computed with a view toimproving coding eciency, they still represent some degree of true motionthat is successfully exploited in motion indexing of block-based videos [43],

  • 296 A. Rahman and M. Murshed

    Search window

    Image frame It

    Block centred at It1(x, y)

    Block centred at It (x, y)

    Image frame It1

    2d+1Best match block

    Motion vector










    Fig. 2. Block motion estimation process. The motion vector of a block of sizea b pixels centred at It(x, y) is estimated by using a search window in frameIt1 centred at It1(x, y) and nding the closest block within the search windowwith the maximum correlation. The displacement vector from the search centre tothe centre of this block gives the motion vector. In the search window a total of(2d + 1) (2d + 1) candidate pixels need to be examined for the full search motionestimation process

    motion-based video indexing and retrieval [25], and neighbouring motion vec-tor prediction [53]. Empirical study has also observed that the block motionsrepresentation of true motion is signicant [49]. This along with its computa-tional eciency motivates few researchers [3742] to use block motion vectorsin temporal texture classication.

    Figure 2 illustrates a block motion estimation process where an imageframe It is segmented into non-overlapped rectangular blocks of a b pix-els each. In practice, square blocks of a = b = 16 are widely used. Nowconsider a current block centred at It(x, y). It is assumed that the block istranslated as a whole. Consequently, only one displacement vector needs tobe estimated for this block. In order to estimate the displacement vector, arectangular search window of (a+2d) (b+2d) pixels is opened in frame It1centred at pixel It1(x, y). Every distinct a b pixel block within the searchwindow is searched exhaustively by the full search [45] algorithm to nd thebest matching block having the maximum correlation with the current blockin frame It It. If multiple blocks have the maximum correlation, the one clos-est to the search centre is preferred mainly for coding eciency as it resultsin shorter motion vector. The inverse of correlation is usually measured usingMean Squared Error (MSE) or Mean Absolute Error (MAE) of the block pairwhere the error for each pixel position is calculated as the dierence in inten-sity values in the co-located position. Once the best matching block is found,

  • Temporal Texture Characterization: A Review 297

    the displacement of its centre from the search centre constitutes the motionvector (x,y) of the current block where x and y are drawn from therange [d, d]. Unless an exact match is found earlier, the full search algorithmexhaustively checks all possible (2d + 1)2 blocks within the search window.

    In order to reduce the search time, some alternative approaches involvinglogarithmic directional search such as Triple Step Search (TSS) [46], New TSS(NTSS) [23], and Hexagon-shape Based Search (HEXBS) [58] are used thatnormally check between 15 and 30 blocks. These algorithms avoid an exhaus-tive search by following the direction of the gradient of the error surface, whichis assumed unimodal. As this underlying assumption is not necessarily alwaystrue, these fast algorithms are often trapped in local minima, with impact onthe quality of motion estimation. Interestingly, now-a-days there are hardwaredevices like Videolink/4 [55] and software solutions like Video Insight [54]that can render block based MPEG videos in real time while keeping optimalmotion quality and thus making motion vectors readily available in real time,as explained in the following section.

    2.3 Motion Frame Sequence

    The term motion frame sequence is quite frequently used in this chapter. Wedene here what we mean by motion frames. A motion frame, computed fromtwo successive image frames (Fig. 3) using any motion estimation algorithm,is a 2D grid. Let Mt denote the t-th motion frame. Each entry in the frame Mtdenotes a motion measure that is either the motion vector or its magnitude ordirection quantized to an integer value. As an example, consider the quantiza-tion process of motion magnitude using the block matching motion estimationalgorithm with maximum displacement of d pixels. Motion magnitude k isquantized to motion measure i if

    max(0, i 0.5) k < min(i + 0.5, d

    2), (6)

    where 0 k < d2, d2 is the maximum possible vector length with dmaximum displacement, and i [0, Q 1] where Q represents the number ofpossible motion measures.

    Fig. 3. The motion frame of the Flag shown with the motion vectors superimposedon the current image frame where motion was estimated using (a) complete ow; (b)normal ow; and (c) block matching algorithm respectively using (d) two successiveimage frames

  • 298 A. Rahman and M. Murshed

    2.4 Motion Co-occurrence Matrix

    A commonly used motion distribution statistics in the existing temporal tex-ture characterization techniques and also in our proposed technique is theMotion Co-occurrence Matrix (MCM). Let Mt(x, y) denote the motion mea-sure at coordinate (x, y) in the motion frame Mt. With pixel level motionestimation, (x, y) refers to the coordinate of the corresponding pixel; whereaswith block level motion estimation, the pair refers to the 2D indices of thecorresponding block. An MCM is a 2D histogram of motion measure pairs inthe motion frames observed along a clique, dened by a 3D neighbour vec-tor = (x, y, t) where x, y, t {. . . ,1, 0, 1, . . . }. Let denote theMCM along clique . If Q motion measures are used then can be formallydened as

    (i, j) = |{xyt(x, y, t)|Mt(x, y) = i Mt+t(x + x, y + y) = j}| , (7)

    where i, j [0, Q 1]. A neighbourhood is identied by a set of cliques ={(x, y, t)}. Cliques with t = 0 constitute the spatial neighbourhood andthe cliques with t = 1 constitute the temporal neighbourhood as illustratedin Fig. 4.

    Let us now consider a step by step process of computing the MCMs for anexample image sequence with just ve image frames. For the sake of simplicity,the resolution of these frames is assumed to be low, such that each of theresulting four motion frames has 3 3 motion measures, as shown in Fig. 5,estimated using block matching with maximum displacement of 3 pixel.The length of the motion vector is quantized to motion measure i that coversthe range max(0, i 0.5) vectorlength < min(i + 0.5, 32) where 32is the maximum possible vector length with 3 maximum displacement andi = 0, 1, . . . , 4. The size of the MCM is then 5 5. Figure 6ac presents theMCMs (0,0,1), (1,0,0), and (1,1,1) respectively. Note that while (0,0,1)and (1,0,0) are computed from 27 and 24 possible pairs in the motion frame

    Fig. 4. Neighbourhood of a motion measure location, marked in red in motion frameMt. Spatial and temporal neighbours are marked in green and blue, respectively

  • Temporal Texture Characterization: A Review 299

    Fig. 5. An example motion frame sequence where each motion measure is the lengthof the corresponding motion vector rounded to the nearest integer

    Fig. 6. MCMs computed from the motion frame sequence in Fig. 5: (a) (0,0,1);(b) (1,0,0); (c) (1,1,1)

    sequence along the respective clique, (1,1,1) is computed from only 12possible pairs, as some of the motion measures in a motion frame have noneighbour along the clique, as illustrated in Fig. 7.

    2.5 Temporal Texture Database

    There are two temporal texture datasets in the literature. The most commonlyused Szummer dataset [51] is available since 1996 and recently the dataset wasmoved in R. Pagets database of temporal textures [30]. The Szummer datasetconsists of a diverse set of temporal textures including boiling water, wavingags, and wind swept grass, etc. In recent times, the European FP6 Networkof Excellence MUSCLE has launched a set of temporal textures and thisdataset is known as DynTex [34]. The quality (resolution and image quality)of DynTex sequences is better than Szummer sequences.

  • 300 A. Rahman and M. Murshed




    1 2

    2 0 2

    3 1 1

    0 1 1

    0 0

    2 4

    2 1

    1 2 4

    1 3 4

    2 2 3

    4 3

    2 4





    Fig. 7. All the possible 12 neighbouring pairs along clique (1, 1,1) on the motionframe sequence in Fig. 5

    3 Temporal Texture Characterization Techniques

    The existing approaches to temporal texture characterization can be classi-ed into one of the following groups: techniques based on motion distributionstatistics, techniques computing geometric properties in the spatiotemporaldomain, techniques based on spatiotemporal ltering and transforms, andmodel-based methods that use estimated model parameters as features. Thefollowing sections of this chapter elaborate all these categories of characteri-zation techniques. A brief survey on temporal texture analysis techniques isavailable in [5].

    3.1 Motion Based Techniques

    In this section we focus on elaborating the existing motion based tempo-ral texture characterization techniques. Any motion based temporal texturecharacterization process can, in general, be divided into three cascaded stages:motion estimation, feature extraction, and classication stage. All of the exist-ing characterization techniques compute either normal ow or block motionat motion estimation stage. We thus concentrate on detailing the featurescomputed at their feature extraction stage.

    Spatial Feature-Based Technique

    Direct use of the normal ow vector eld for temporal texture recognitionwas rst realized by Nelson and Polana in their study of Spatial Feature-based Texture Recognition (SFTR) [2628,33]. Several statistical features areexamined, based on distribution of magnitudes and directions of normal ows,

  • Temporal Texture Characterization: A Review 301

    as shown in Fig. 8 for the Fire sequence. Figure 8a depicts computation ofthe normal ow eld of the Fire sequence and its magnitude (Fig. 8b) anddirection (Fig. 8c).

    The feature set of SFTR technique is presented in Table 1. Non-uniformityin direction of motion is computed from a directional histogram of eight binsby adding the dierences between the histogram and the uniform distribution.The inverse coecient of variation is computed as the ratio of the mean andthe standard deviation of motion magnitudes. Statistics of some ow features,namely estimates of the positive and negative divergence, and positive andnegative curl of the motion eld are obtained from the normal ows.

    Normal ow distribution features are also derived from the dierence sta-tistics. These rst order dierence statistics are represented by four pixel levelMCMs in the spatial domain (1,0,0), (1,1,0), (0,1,0) and (1,1,0). For eachclique used, the ratio of the number of neighbouring pixel pairs diering indirection by at most one to the number of pixel pairs diering by more thanone is computed. Second order features,2 namely spatial homogeneity of theow, is obtained from the logarithms of the resulting ratios.

    Fig. 8. (a) Normal ow eld of the Fire sequence: (b) magnitude plot; and(c) direction plot. Magnitude and direction plots are drawn by mapping the magni-tude and direction values into 8-bit grayscale values

    Table 1. Feature set of the SFTR technique

    Feature ID Feature measure

    1 Non-uniformity of ow direction2 Inverse coecient of variation3 Positive divergence4 Negative divergence5 Positive curl6 Negative curl7 Spatial homogeneity obtained from (1,0,0)8 Spatial homogeneity obtained from (1,1,0)9 Spatial homogeneity obtained from (0,1,0)10 Spatial homogeneity obtained from (1,1,0)

    2 First order features are computed directly from the motion frames and k-th orderfeatures are computed from (k1)-th order features.

  • 302 A. Rahman and M. Murshed

    This study highlighted the computational possibility of using low levelspatial motion features for temporal texture recognition. However, this worklacks any mechanism to handle temporal evaluation, since studied interactionsare purely spatial [32].

    Spatiotemporal Clique Neighbourhood Techniques

    Fablet and Bouthemy published a series of studies [1,1519] devoted to recog-nition of temporal texture and other motion patterns. They rst introducedthe concept of the temporal co-occurrence matrix of normal ows. Motivatedby the fact that in SFTR there is no mechanism to handle temporal evolution,in their early paper [1] they used standard co-occurrence features (Table 2)namely average, variance, dirac, Angular Second Moment (ASM) and contrastobtained from the temporal MCM (0,0,1) to discriminate between temporaltextures. Note that computed features are second order features.

    Temporal MCM in [1], however, fails to encode any spatial information,and later on the authors developed the Spatiotemporal Clique Neighbour-hood (STCN) technique [16] where the interaction between a pixel and aset of spatially adjacent temporal neighbours (Fig. 4) is encoded by com-puting co-occurrence matrices for each clique in either the entire temporalneighbourhood of nine cliques or a temporal neighbourhood of ve cliques{(0, 0,1), (1, 0,1), (0, 1,1), (1, 0,1), (0,1,1)} to incorporate somedegree of spatial information. A causal spatiotemporal free energy model isused to combine these motion co-occurrence matrices, and the underlyingmodel is optimized by maximizing the free energy using the conjugate gradi-ent method.

    Incorporation of spatial information through a set of temporal neighboursin fact is still biased towards time domain and fails to incorporate any signi-cant spatial motion distribution information. Moreover, the underlying modeloptimization is more focussed towards optimizing the free energy rather thanfeature weights and thus ultimately fails to maintain an appropriate featureweight distribution between timespace domains, leaving room for improve-ment in classication accuracy.

    Table 2. The feature set in [1] obtained from a temporal co-occurrence matrix ofnormal ows

    Features Mathematical formula

    Average avg =

    (i,j) iP(0,0,1)(i, j)Variance 2 =

    (i,j)(i avg)2P(0,0,1)(i, j)

    Dirac dirac = avg2/2

    Angular second ASM =

    (i,j)[P(0,0,1)(i, j)]2

    momentContrast Cont =

    (i,j)(i j)2P(0,0,1)(i, j)

    Here P0,0,1 represents normalized MCM 0,0,1

  • Temporal Texture Characterization: A Review 303

    Spatiotemporal Synergistic Approach

    With a view to combine the spatial and temporal aspects of temporal tex-tures in a synergistic way, Peh and Cheong developed the Synergizing Spa-tial and Temporal Features (SSTF) technique [31, 32]. Aimed at providing aspatiotemporal analysis on the motion of objects, the magnitudes and direc-tions of normal ows are mapped into grayscale intensity levels for subsequentanalysis. Textures generated in this way are referred to as magnitude plotsand directional plots for magnitudes and directions of the normal ow respec-tively. In order to trace the motion history, the magnitude and directionalplots of successive motion frames are further superimposed independently.Spatiotemporal textures (Fig. 9) extended this way are referred to as the Ex-tended Magnitude Plot (EMP) and Extended Directional Plot (EDP), formagnitudes and directions of the normal ows, respectively.

    The feature set of the SSTF technique is presented in Table 3. A sub-set of the features is computed from the extended plots by the Gray LevelCo-occurrence Matrix (GLCM) and the Fourier spectrum analysis. GLCM issimilar to any pixel level spatial-domain MCM except that the former usesgrayscale intensity values instead of motion measures involving only one frame.Conventional co-occurrence features, namely inertia, shade, correlation, andmean are computed from the average of co-occurrence matrices correspond-ing to cliques (1,0,0), (1,1,0), (0,1,0) and (1,1,0). Energy centred at 45and 135 was computed from the Fourier spectrum. Note that the orders ofcomputed features are high.

    Such a representation has the advantage of improving computational e-ciency, as features need to be computed from one frame only. Merging a long

    Fig. 9. Some examples of images with their extended magnitude and directionalplots: (a) texture images; (b) extended magnitude plots; and (c) extended directionplots

  • 304 A. Rahman and M. Murshed

    Table 3. List of features obtained from the extended plots, EMP and EDP, in theSSTF technique

    Analytical Feature Mathematical formulatechnique

    GLCM Inertia

    (i j)2PG(i, j)Shade

    (i + j mx my)3PG(i, j)

    Correlation 1xy

    ijPG(i, j)mxmy

    Fourier Energy centred at 45


  • Temporal Texture Characterization: A Review 305



    (a) Grid sequences along clique (0,0,1). (b) Grid sequences along clique (1,0,0).

    (c) Grid sequences along clique (0,1,0).
















    Fig. 10. Grid sequences along dierent cliques

    the double jeopardy concern and thus making the feature extraction stage ofOTSR technique realisable in real time while aiming to improve the classi-cation accuracy at block level or even matching the same at pixel level.

    Other Motion Based Techniques

    Peteri and Chetverikov [35, 36] proposed a technique that combines normalow features with periodicity features, in an attempt to explicitly characterizemotion magnitude, directionality, and periodicity. The normal ow featuresused is similar to SFTR (e.g., divergence, curl, etc.); however, a novel featureof orientation homogeneity (Fig. 11) of the normal ow eld was also intro-duced. In addition, two spatiotemporal periodicity features were proposedbased on the maximal regularity, which is a measure of spatial periodicity ofan image texture. When applied to a temporal texture, the method evalu-ates the temporal variation of spatial periodicity. For each motion frame Mtof a temporal texture sequence, maximal regularity is computed in a slidingwindow. Then the largest value is selected, corresponding to the most peri-odic patch within the frame. This provides a largest periodicity value for eachMt. The mean and the variance of the largest periodicity value are used as

  • 306 A. Rahman and M. Murshed

    Fig. 11. Orientation homogeneity for (a) the Escalator and (b) the Smoke sequencescomputed from normal ow vectors. The main orientation is pointed by the triangleand its homogeneity is proportional to the base of the triangle

    features. This approach is rotation-invariant, and its periodicity-related partis ane-invariant. Although promising, the temporal regularity method stillneeds to be improved. Like the SFTR technique, it fails to address the problemof integrating temporal information with sucient importance.

    The last motion based characterization technique we are going to elaborateis proposed by Lu et al. [24]. This study is unique in that it uses complete owvectors. In addition, acceleration vectors are also computed. The 3D structuretensor technique (with spatiotemporal gradient) is used to obtain the completeow vector by minimising an energy function in a neighbourhood of a pixel.To reduce the eect of the aperture problem, the eigenvectors of the tensor arecalculated and combined into a measure of spatial cornerity of the pixel. Thismeasure is used as the weight in the histograms of velocity and acceleration:the higher the condence of velocity estimation the larger the weight. Toaccount for scale, a spatiotemporal Gaussian lter is applied to decompose asequence into two spatial and two temporal resolution levels. The technique isrotation-invariant, and it provides local directionality information. However,no higher-level structural analysis (e.g., periodicity evaluation) is possible.Moreover, computing complete optical ow is highly time-consuming and thetechnique is unsuitable for real time recognition.

    3.2 Geometric Property Based Techniques

    Geometric property based techniques focus on properties of surfaces of mo-tion trajectories in spatiotemporal space derived from multiple frames of animage sequence. The motion trajectory (also called trajectory surface) is aset of surfaces swept out by the moving contour in the spatiotemporal spaceof temporal textures. The spatiotemporal properties of the trajectory surfaceare analysed by the studies in this group to characterize temporal textures.Techniques of computing such geometric properties in the spatiotemporal do-main are proposed by Otsuka et al. [29] and its modication by Zhong andSclaro [57].

  • Temporal Texture Characterization: A Review 307

    Tangent line

    Tangent planes

    Moving direction

    Intersection lineIntersection





    yTrajectory surface

    (a) (b)

    Fig. 12. (a) Tangent line on moving contour and intersection point in 2D space;(b) tangent plane and intersection line in 3D space

    In the framework in [29], trajectory surfaces are represented as a set oftangent planes of the surfaces. Tangent planes can be explained from Fig. 12.Consider two pints on an objects contour in the 2D image plane (Fig. 12a).Two non-parallel tangent lines of the two points have only one point of in-tersection. From the spatiotemporal point of view, the contours of objectswill sweep out 2D surfaces, i.e., trajectory surfaces and a tangent line willsweep out a tangent plane of the trajectory surface (Fig. 12b). The tangentplane thus characterizes the trajectory surface, and a set of spatial featuresregarding shape and spatial arrangement of the contour is obtained from thetangent plane distribution of the constraint surface (Table 4).

    For temporal features, the authors focused on the distribution of imagevelocity obtained from the intersection lines of the tangent planes. The pointof intersection of the tangent lines becomes an intersection line of the tangentplanes (Fig. 12b). These intersection lines have an orientation that equals themotion trajectory of the object. Image velocity is approximated by detect-ing the orientation of the intersection lines formed by tangent planes on thetrajectory surface. Temporal features are obtained from the tangent planedistribution in the image velocity direction (Table 4).

    This technique based on the tangent plane distribution is robust againstnoise and occlusion. It allows discontinuities in image brightness, while manygradient based methods require smoothness. The main drawback of the tech-nique is that it requires large computational time and storage. Moreover, itis highly unlikely for temporal textures to have a dominant trajectory surfacewith indeterminate spatial and temporal extent. Even if such a surface exists,extraction of the surface is not an easy task, and the accuracy of the techniqueis questionable [32].

    Trajectory surfaces are obtained in [29] by image dierencing followed bybinary quantization and then identifying points in the image volume that be-long to the trajectory surface. The quantization in fact introduces noise in sucha process, which motivates the authors in [57] to deal with such limitationsand also consider some second order surface features, while only rst order

  • 308 A. Rahman and M. Murshed

    Table 4. List of features obtained from trajectory surfaces in [29]

    Domain Features Properties


    Directionality of contourarrangement

    Sharpness measure of direction of tangentplanes.

    Scattering of contourplacement

    Interval of the tangent lines of contours onimage plane.


    Uniformity of velocitycomponents

    A measure of diversity of motion vectors.

    Flash motion ratio Amount of very rapid motion and theabrupt appearance and disappearance ofobjects.

    Run length of trajectory Characterizes sparkle and patternswherein objects repeatedly appear anddisappear quite rapidly.

    Input temporaltexture

    Input temporaltexture

    3D edgeextraction

    3D edgeextraction







    Fig. 13. Flowchart of the characterization technique in [57]: (a) training, and (b)classication

    surface features are used in [29]. Second order surface features are extractedfrom the curvatures on the trajectory isosurface generated by the temporaltexture. An overview of the technique is presented in Fig. 13.

    A 3D gradient lter is applied rst to extract the edges at every imagepoint in the sequence. Based on the gradient vector, likely trajectory pointsare obtained in the spatiotemporal space. The original spatiotemporal imagesequence volume is separated into a number of small cubic voxels. The featurevectors are extracted only on the edge voxels composing the 3D edge surfaces.These features include the tangent plane direction, edge strength, and theprincipal curvatures. The mean of the feature vector is computed within eachsmall cube. A Gaussian Mixture Model (GMM) is then used to model thedistribution of the features in the feature space. This enables the system tocapture the variations of the feature vectors in the dierent spatiotemporallocations for one single temporal texture.

    The features in [57] are extracted from intensity volume data instead ofbinary data, thus avoiding loss of information due to quantization noise. How-ever, application of gradient lter for edge detection at every pixel is a timeconsuming process, thus the technique is unsuitable for real time applications.

  • Temporal Texture Characterization: A Review 309

    3.3 Spatiotemporal Filtering and Transform Based Techniques

    The study by Wildes and Bergen [56] addresses the only method based on lo-cal spatiotemporal ltering. Qualitative classication of local motion structureinto categories such as stationary, coherent, incoherent, ickering and scintil-lating motion is performed by the analysis of local spatiotemporal pattern, itsorientation, and energy. The correlation between the qualitative features andthe character of motion is established by experimental results in [56], assum-ing that small and short temporal textures are considered. However, motionin dierent parts of a temporal texture can be dierent, and collectively theyrepresent the dynamics of the texture. No method so far, including the studyin [56], provides any guideline to combine the local qualitative values into aglobal description, or to characterize the fundamental structural properties ofthe entire temporal texture.

    There have been attempts [48] at video texture indexing using globalspatiotemporal transforms. The emerging use of global spatiotemporal trans-forms indicates the necessity to characterize motion at dierent spatiotem-poral scales. Spatiotemporal wavelets can decompose motion into local andglobal, according to the desired degree of detail. For example, a tree wavingin the wind shows a coarse motion of trunk, a ner motion of branches and astill ner motion of leaves. The periodicities of these motions are also dier-ent, resulting in energy maxima at dierent scales. These eects are capturedby spatiotemporal wavelets.

    The Discrete Wavelet Transform (DWT) is a linear transformation toolthat separates data into dierent frequency components and then studies eachcomponent with a resolution matched to its scale. The 3D wavelet transform(Fig. 14) applied in [48] can be viewed as an extension of spatial image textureanalysis using 2D wavelets. DWT is performed rstly for all image rows andthen for all columns. With 3D DWT, each temporal texture image sequence isdecomposed using three iterations of a 3D wavelet lter bank. Each iterationof the wavelet lter bank results from one iteration on each spatial dimension

    Fig. 14. Classical 3D wavelet transform scheme

  • 310 A. Rahman and M. Murshed

    and two iterations on the temporal dimension. Each iteration produces a re-duction by a factor of four in the spatial and temporal dimensions. The waveletdecomposition is repeated by iterating on the low frequency sub-band of thewavelet lter bank until the desired degree of detail is produced.

    From the wavelet lter bank, two methods are used in [48] for extractingvideo texture descriptors. The rst method computes energy across all of theframes in order to capture the degree of temporal dynamicity. This methodis suitable for describing video content such as continuous apping of wingsin a ock of birds. The second method treats each frame of the sub-bandsas a dierent dimension in the video texture descriptor in order to capturetemporally evolving texture. This method is suitable for describing video con-tent such as the rippling in a puddle that dies out over time after a stone isdropped.

    Use of spatiotemporal wavelets is highly motivated by the fact thattemporal textures possess typical textural properties such as randomness,periodicity, and directionality that can be captured by wavelet transforms.Another argument in favour of wavelets is the fact that the MPEG-7 multi-media standard proposes the use of Gabor wavelet features for image texturebrowsing and retrieval. However, a strong argument against global spatiotem-poral transforms is the diculty of providing rotation invariance due to theirdirect reliance on pixel intensity values.

    3.4 Model Based Techniques

    Model-based temporal texture recognition techniques use a framework basedon a system identication theory which estimates the parameters of a stabledynamic model. These techniques are suitable for both recognition and synthe-sis. Recognition is performed using estimated model parameters as features.Synthesis is performed by applying model parameters on a seed image frameto predict future image frames. Dierent model based recognition techniquesare discussed in the following four sections.

    Spatiotemporal Auto-Regressive Model

    Modelling image sequences of temporal textures using the linear Spatiotem-poral Auto-Regressive (STAR) model was rst proposed by Szummer andPicard [5052]. It is a 3D extension of the popular Auto-Regressive (AR)models that are among the best models for recognition and synthesis of imagetextures. In this technique, every pixel is modelled as a linear combination ofneighbouring pixels lagged in time and space plus a noise term as follows:

    It(x, y) =i

    AiIt+ti(x + xi, y + yi) + et(x, y), (8)

    where et(x, y) is a Gaussian white noise process and the lags xi, yi, and tispecify the neighbourhood structure of the model. Parameters of the STAR

  • Temporal Texture Characterization: A Review 311

    model, Ais are learned by minimizing the mean square prediction error. Onlylinear models are examined by this study to model Steam, Boiling water,and River sequences convincingly. However, it fails to capture any rotationalmodel due to its reliance on pixel intensities directly, and also the are highlytime consuming due to large number of model parameters.

    Auto-Regressive Moving Average Process

    Criticizing STAR models failure to capture rotational and non-translationalmotions, Doretto et al. [6] proposed applying the model on lower dimensionalrepresentation [11, 14, 44] of the image as such representations are rotationinvariant. Assuming that {It}t=1,...,f be a sequence of images and at = It + tbe a noisy version of the image with t being an Independent and IdenticallyDistributed (IID) sequence drawn from distribution p(), a dynamic textureis associated to an Auto-Regressive Moving Average process with Unknowninput (ARMAUX) {

    at = A1at1 + + Akatk + Bet;ct = (at) + t,


    where at is a lower dimensional representation of image It obtained usinglter such that It = (at) and et is the realization from a stationary distri-bution ps() with stationary invariant statistics for some choices of matricesA1, . . . , Ak, B and initial condition a0 = . Assuming (at) = Pcat with Pcbeing a set of principal components or a wavelet lter bank, parameters ofthe model are inferred using Maximum Likelihood Estimation (MLE) as

    given a0, c1, . . . , cf ,ndA, B, ps() = argmaxA,B,ps() log p(c1, . . . , cf )such that

    {at+1 = Aat + Betct = Pcat + bt

    and etIID ps() .


    Although the ARMAUX model achieves impressive results [614] for recog-nizing and synthesizing temporal textures, the applicability of the techniqueto real videos is doubtful for several reasons [20]: this technique is highlytime consuming due to the large number of model parameters; moreover it isdicult to dene a similarity metric in the space of dynamic models.

    Modelling Using Impulse Responses of State Variables

    In order to eliminate the diculty of dening the similarity metric, Fujita andNayar [21] modied the approach in [44] by using impulse responses of statevariables to identify temporal textures. From the viewpoint of system iden-tication, the impulse responses capture the inherent dynamical properties,

  • 312 A. Rahman and M. Murshed

    and at the same time they are very ecient to compute and compare. Therecognition scheme is divided into two stages.

    In the learning stage, the original image sequences are rst divided intolocal block sequences. These blocks are labelled accordingly to the types oftextures they contain. Next, the ARMAUX is applied to each block sequenceto obtain the model parameters A, B, and ps(). Then, from the A matrix foreach block sequence, n-dimensional impulse responses are computed as

    k+1 = Ak , (11)

    where k n, A nn, and 0 = [1, 1, . . . , 1]T . The impulse responsesof all the blocks that belong to the same texture (same label) are used tocompute a linear space, and then they are mapped to this space to obtaintrajectories.

    In the recognition stage, the model parameter matrix A for a given novelblock sequence is used to compute n-dimensional impulse responses. Theseimpulse responses are mapped to trajectories in each of the linear spaces(corresponding to dierent textures) that were computed in the learning stage.Finally, to recognize the dynamic texture in a novel block sequence, a nearestpoint search is conducted. Despite its superiority over the basic ARMAUXmodel, this approach still suers the problem of heavy computational loaddue to the large number of model parameters.

    Other Model Based Techniques and Summary

    Chan and Vasconcelos [4] introduced a framework for the classication oftemporal textures modelled with ARMAUX models using probabilistic ker-nels. The new framework combines the modelling power of ARMAUX modelsfor temporal textures and the generalization guarantees, for classication, ofthe support vector machine classier. This combination is achieved by thederivation of a new probabilistic kernel based on the KL divergence betweenGaussMarkov processes. The kernels cover a large variety of video classica-tion problems, including the cases where classes can dier in both appearanceand motion and the cases where appearance is similar for all classes and onlymotion is discriminant. However, due to using ARMAUX models, this ap-proach is also highly computationally expensive.

    A careful scrutiny of the above-mentioned model based temporal tex-ture characterization techniques reveals that, although impressive results areachieved using model based approaches, their application for recognition in areal world environment is limited due to heavy computational load.

    4 Conclusion

    In this chapter we have presented a set of studies on temporal texture analysis.More precisely we focused on characterization techniques of temporal textures

  • Temporal Texture Characterization: A Review 313

    and have provided a detailed categorization of the dierent characterizationprocesses. In conclusion, temporal texture analysis is a novel, exciting, anddeveloping research area. We hope that the research focussed on temporaltexture characterization, will make a signicant contribution to expand theapplicability of temporal texture analysis in many unexplored areas.


    1. Bouthemy P., Fablet R.: Motion characterization from temporal cooccurrencesof local motion-based measures for video indexing. International Conference onPattern Recognition (ICPR). Volume 2, Brisbane, Australia (1998) 905908.

    2. Brox T., Bruhn A., Papenberg N., Weickert J.: High accuracy optical owestimation based on a theory for warping. European Conference on ComputerVision (ECCV). Volume 4, Prague, Czech Republic (2004) 2536.

    3. Bruhn A., Weickert J., Feddern C., Kohlberger T., Schnorr C.: Real-time opticow computation with variational methods. CAIP. Groningen, The Netherlands(2003) 222229.

    4. Chan A. B., Vasconcelos N.: Probabilistic Kernels for the Classication of Auto-regressive Visual Processes. IEEE computer society Conference on ComputerVision and Pattern Recognition (CVPR). San Diego (2005).

    5. Chetverikov D., Peteri R.: A brief survey of dynamic texture descriptionand recognition. International Conference on Computer Recognition Systems(CORES). Rydzyna, Poland (2005).

    6. Doretto G., Chiuso A., Soatto S., Wu Y. N.: Dynamic textures. InternationalJournal of Computer Vision (IJCV). Volume 51 (2003) 91109.

    7. Doretto G., Jones E., Soatto S.: Spatially homogeneous dynamic textures.European Conference on Computer Vision (ECCV). Volume 2, Prague, CzechRepublic (2004) 591602.

    8. Doretto G., Soatto S.: Towards plenoptic dynamic textures. International Work-shop on Texture Analysis and Synthesis. Nice, France (2003) 2530.

    9. Doretto G., Soatto S.: Editable dynamic textures. IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR). Volume 2,Madison, Wisconsin (2003) 137142.

    10. Doretto G., Soatto S.: Editable dynamic textures. Conference Abstracts andApplications of SIGGRAPH. San Antonio, Texas (2002) 177.

    11. Doretto G.: Dynamic texture modeling. M.S. Thesis. Computer Science De-partment, University of California, Los Angeles, California (2002).

    12. Doretto G., Soatto S.: Towards plenoptic dynamic textures. UCLA Com-puter Science Department Technical Report (#020043). Los Angeles, California(2002).

    13. Doretto G., Soatto S.: Editable dynamic textures. UCLA Computer ScienceDepartment Technical Report (#020001). Los Angeles, California (2002).

    14. Doretto G., Pundir P., Wu Y. N., Soatto S.: Dynamic textures. UCLA Com-puter Science Department Technical Report (#200032). Los Angeles, California(2000).

    15. Fablet R., Bouthemy P.: Motion based feature extraction and ascendant hier-archical classication for video indexing and retrieval. International Conferenceon Visual Information Systems (1999) 221228.

  • 314 A. Rahman and M. Murshed

    16. Fablet R., Bouthemy P., Perez P.: Nonparametric motion characterization usingcasual probabilistic models for video indexing and retrieval. IEEE Transactionson Image Processing. Volume 11 (2002) 393407.

    17. Fablet R., Bouthemy P.: Motion recognition using nonparametric image motionmodels estimated from temporal and multiscale co-occurrence statistics. IEEEtransaction on pattern analysis and machine intelligence. Volume 25 (2003)16191624.

    18. Fablet R., Bouthemy P.: Non parametric motion recognition using temporalmultiscale Gibbs models. IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR). Volume 1 (2001) 501508.

    19. Fablet R., Bouthemy P.: Motion recognition using spatio-temporal randomwalks in sequence of 2D motion-related measurements. IEEE International Con-ference on Image Processing (ICIP). Thessalonique, Greece (2001) 652655.

    20. Fazekas S., Chetverikov D.: Normal versus complete ow in dynamic texturerecognition: a comparative study. International workshop on texture analysisand synthesis at International Conference on Computer Vision (ICCV). Beijing,China (2005).

    21. Fujita K., Nayar S. K.: Recognition of dynamic textures using impulse responsesof state variables. International Workshop on Texture Analysis and Synthesis(2003) 3136.

    22. Horn B., Schunck B.: Determining optical ow. Articial Intelligence.Volume 17 (1981) 185203.

    23. Li R., Zeng B., Liou M. L.: A new three-step search algorithm for block motionestimation. IEEE Transaction on Circuits and Systems for Video Technology.Volume 4 (1994) 438442.

    24. Lu Z., Xie W., Pei J., Huang J. J.: Dynamic Texture Recognition by Spatio-Temporal Multiresolution Histograms. IEEE Workshop on Motion and VideoComputing (WACV/MOTION). Volume 2 (2005) 241246.

    25. Ma Y. F., Zhang H. J.: Motion texture: a new motion based video representa-tion. International Conference on Pattern Recognition (ICPR). Volume 2 (2002)548551.

    26. Nelson R. C., Polana R.: Recognition of motion using temporal texture. IEEEcomputer society Conference on Computer Vision and Pattern Recognition(1992) 129134.

    27. Nelson R. C., Polana R.: Qualitative recognition of motion using temporaltexture. CVGIP image understanding. Volume 56 (1992) 7889.

    28. Nelson R. C., Polana R.: Temporal Texture Analysis. DARPA Image Under-standing Workshop (1992) 555559.

    29. Otsuka K., Horikoshi T., Suzuki S., Fujii M.: Feature extraction of temporaltexture based on spatiotemporal motion trajectory. International Conferenceon Pattern Recognition (ICPR). Volume 2 (1998) 10471051.

    30. Paget R.: Texture synthesis and analysis. http://www.vision.ee.ethz.ch/rpaget/links.htm (Last accessed in January, 2006).

    31. Peh C.H., Cheong L.F.: Exploring video content in extended spatiotempo-ral textures. European workshop on Content-Based Multimedia Indexing.Toulouse, France (1999) 147153.

    32. Peh C. H., Cheong L. F.: Synergizing spatial and temporal texture. IEEE Trans-actions on Image Processing (2002) 11791191.

    33. Polana R., Nelson R. C.: Temporal texture and activity ecognition. M. Shah andR. Jain, editors. Motion-Based Recognition. Kluwer, Dordrecht (1997) 87115.

  • Temporal Texture Characterization: A Review 315

    34. Peteri R., Huskies M.: DynTex: A comprehensive database of Dynamic Tex-tures. http://www.cwi.nl/projects/dyntex/ (Last accessed in June 2007).

    35. Peteri R., Chetverikov D.: Dynamic texture recognition using normal ow andtexture regularity. LNCS. Iberian Conference on Pattern Recognition and ImageAnalysis (2005).

    36. Peteri R., Chetverikov D.: Qualitative characterization of dynamic textures forvideo retrieval. International Conference on Computer Vision and Graphics(ICCVG). Warsaw, Poland (2004).

    37. Rahman A., Murshed M.: A temporal texture characterization technique usingblock-based approximated motion measure. IEEE Transaction on Circuits andSystems for Video Technology. Volume 17 (2007) 13701382.

    38. Rahman A., Murshed M.: Real-time temporal texture characterization usingblock-based motion co-occurrence statistics. IEEE International Conference onImage Processing (ICIP). Singapore (2004) 15931596.

    39. Rahman A., Murshed M.: A robust optical ow estimation algorithm for tempo-ral textures. IEEE International Conference on Information Technology: Codingand Computing (ITCC). Las Vegas, USA (2005) 7276.

    40. Rahman A., Murshed M.: A motion-based approach for temporal texture syn-thesis. IEEE Region 10 Conference on Convergent Technologies (TENCON).Melbourne, Australia (2005).

    41. Rahman A., Murshed M.: Feature weighting methods for abstract features ap-plicable to motion based video indexing. IEEE International Conference on In-formation Technology: Coding and Computing (ITCC). Las Vegas, USA (2004).

    42. Rahman A., Murshed M.: Multi center retrieval (MCR) technique applicableto motion based video retrieval. International Conference of Computer andInformation Technology (ICCIT). Dhaka, Bangladesh (2004) 347350.

    43. Richardson I. E. G.: H.264 and MPEG-4 Video Compression. Wiley, Chichester,(2003).

    44. Saisan P., Doretto G., Wu Y. N., Soatto S.: Dynamic texture recognition. IEEEComputer Society Conference on Computer Vision and Pattern Recognition(CVPR). Volume 2, Kauai, Hawaii (2001) 5863.

    45. Schodl A., Szeliski R., Salesin D., Essa I.: Video textures. ACM SIGGRAPH(2000).

    46. Shi Y. Q., Sun H.: Image and Video Compression for Multimedia Engineering.CRC, Boca Raton, (2000).

    47. Sikora T.: MPEG digital video-coding standards. IEEE Signal Processing Mag-azine. Volume 14 (1997) 82100.

    48. Smith J. R., Lin C. Y., Naphade M.: Video texture indexing using spatiotem-poral wavelets. IEEE International Conference on Image Processing (ICIP).Volume 2 (2002) 437440.

    49. Sorwar G., Murshed M., Dooley L. S.: Filtering of block motion vectors for usein motion-based video indexing and retrieval. IEICE Transactions (2005).

    50. Szummer M., Picard R. W.: Temporal texture modeling. IEEE InternationalConference on Image Processing (ICIP). Lausanne, Switzerland (1996) 823826.

    51. Szummer M.: Temporal texture modeling. M. Engg. Thesis. MIT (1996).52. Szummer M.: Temporal texture modeling. Technical report (#346). MIT

    (1995).53. Turaga D. S., Tsuhan C.: Estimation and mode decision for spatially corre-

    lated motion sequences. IEEE Transaction on Circuits and Systems for VideoTechnology. Volume 11 (2001).

  • 316 A. Rahman and M. Murshed

    54. Videoinsight, http://www.video-insight.com/dvr005.htm (Last accessed in2005).

    55. Videolink/4, http://www.compumodules.com/security/mpeg-4-encoder.shtml(Last accessed in July 2005).

    56. Wildes R. P., Bergen J. R.: Qualitative spatiotemporal analysis using an ori-ented energy representation. European Conference on Computer Vision (2000)768784.

    57. Zhong J., Sclaro S.: Temporal texture recongnition model using 3D features.Technical report. MIT Media Lab Perceptual Computing (2002).

    58. Zhu C., Chau L. P., Lin X.: Hexagon-based search pattern for fast block motionestimation. IEEE Transaction on Circuits and Systems for Video Technology.Volume 12 (2002) 349355.

  • Part IV

    Computational Intelligence in MultimediaNetworks and Task Scheduling

  • Real Time Tasks Scheduling Using HybridGenetic Algorithm

    Mitsuo Gen1 and Myungryun Yoo2

    1 Graduate School of Information, Production and Systems, Waseda University,Japan, gen@waseda.jp

    2 Department of Computer Science and Media Engineering, Musashi Institute ofTechnology, Japan, yoo@cs.musashi-tech.ac.jp

    Summary. The objective of the scheduling soft real-time tasks is to minimize to-tal tardiness and the scheduling these tasks on multiprocessor system is NP-hardproblem. In this chapter, scheduling algorithms for soft real-time tasks using geneticalgorithm (GA) are introduced. GA has been known to oer signicant advantagesagainst conventional heuristics by using simultaneously several search principles andheuristics.

    The objective of this study is to propose reasonable solutions for NP-hardscheduling problem which much less diculties than those of traditional mathe-matical methods.

    A continuous task scheduling, real-time task scheduling on homogeneous systemand real-time task scheduling on heterogeneous system are included in this chapter.

    1 Introduction

    Real-time tasks can be classied to many kinds. Some real-time tasks areinvoked repetitively. For example, one may wish to monitor the speed, altitude,and attitude of an aircraft every 100ms. This sensor information will be usedby periodic tasks that control the control surfaces of the aircraft, in orderto maintain stability and other desired characteristics. In contrast, there aremany other tasks that are aperiodic, that occur only occasionally. A periodictasks with a bounded interarrival time are called sporadic tasks. Real-timetasks can also be classied according to the consequences of their not beingexecuted on time. Critical (or hard real-time) tasks are those whose timelyexecution is critical. If deadline are missed, catastrophes occur. Noncritical (orsoft real-time) tasks are, as name implies, not critical to the application [1].

    For task scheduling, the purpose of general task scheduling is fairness whichmeans that the computers resources must be shared out equitably amongusers. However, the purpose of hard real-time task scheduling is to execute,by the appropriate deadlines, its critical control tasks and the objective of thescheduling soft real-time tasks is to minimize total tardiness [2].M. Gen and M. Yoo: Real Time Tasks Scheduling Using Hybrid Genetic Algorithm, Studies in

    Computational Intelligence (SCI) 96, 319350 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 320 M. Gen and M. Yoo

    There are some traditional scheduling algorithm for hard real-time taskson uniprocessor, such as rate monotonic (RM) and earliest deadline rst(EDF) [3] scheduling algorithm. They guarantee the optimality in somewhatrestricted environments. Several derived algorithm from RM, EDF is used forsoft real-time tasks. However, these algorithms have some drawbacks in re-source utilization and pattern of degradation under the overloaded situation.With the growth of soft real time applications, the necessity of schedulingalgorithms for soft real-time tasks is on the increase. Rate regulating propor-tional share (rrPS) [4] scheduling algorithm and modied proportional share(mPS) [5] scheduling algorithm are designed for soft real-time tasks. However,these algorithms also cannot show the graceful degradation of performanceunder an overloaded situation and are restricted in a uniprocessor system.

    Furthermore, the scheduling on multiprocessor system is NP-hard prob-lem. According to Yalaoui and Chu [6], the problem of scheduling tasks onidentical parallel processors to minimize the total tardiness is at least NP-hardproblem since Du and Leung showed that the problem is NP-hard problemfor a single processor case [7]. Lenstra et al. also showed that the problemwith two processors is NP-hard problem [8]. Nevertheless the exact complex-ity of this problem remains open for more than two processors. Consequentlyvarious modern heuristics based algorithms have been proposed for practicalreason.

    In this chapter, scheduling algorithms for soft real-time tasks using geneticalgorithm (GA) are introduced. GA has been known to oer signicant advan-tages against conventional heuristics by using simultaneously several searchprinciples and heuristics. GA has been used already for scheduling problem inmanufacturing system such as job shop scheduling, ow shop scheduling andmachine scheduling.

    The objective of this study is to propose reasonable solutions for NP-hard scheduling problem which much less diculties than those of traditionalmathematical methods.

    This chapter is organized as seven sections. Section 2 describes the real-time task scheduling and Sect. 3 presents the basic denition and imple-mentation procedure of genetic algorithm (GA). Scheduling algorithms forcontinuous task are introduced in Sect. 4. Section 5 introduces real-time taskscheduling algorithms on homogeneous multiprocessor system and Sect. 6 in-troduces real-time task scheduling algorithms on heterogeneous multiproces-sor system. Finally, Sect. 7 provides a conclusion of this chapter.

    2 Real-Time Task Scheduling Problem

    Real-time tasks are characterized by computational activities with timingconstraints and classied into two categories: Hard real-time task and Softreal-time task. In this section, the scheduling problem for real-time tasks isexplained.

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 321

    2.1 Hard Real-Time Task Scheduling

    In hard real-time task, the violation of timing constraints of certain taskshould not be acceptable [9]. The consequences of not executing a task beforeits deadline may lead to catastrophic consequences in certain environmentsi.e., in patient monitoring system, nuclear plant control, etc.

    In hard real-time task, the performance of scheduling algorithm is mea-sured by its ability to generate a feasible schedule for a set of real-time tasks.If all the tasks start after their release time and complete before their dead-lines, the scheduling algorithm is feasible. Typically, there is rate monotonic(RM) and earliest deadline rst (EDF) derived scheduling algorithms for hardreal-time tasks [10,11]. They guarantee the optimality in somewhat restrictedenvironments.

    Rate monotonic (RM) scheduling algorithm. RM Scheduling is proposed byLiu and Layland [3]. This scheduling algorithm is based on uniprocessorstatic-priority preemptive scheme and optimal among all xed priorityscheduling algorithm. This algorithm assigns static priority to the taskssuch that tasks with shorter period get higher priorities.If the total utilization of the tasks is no greater than n(21/n 1), wheren is the number of tasks to be scheduled, then RM scheduling algorithmwill schedule all the tasks to meet their respective deadlines.

    Earliest deadline rst (EDF) scheduling algorithm. EDF scheduling is pro-posed by Liu and Layland [3]. This scheduling algorithm is based onuniprocessor dynamic-priority preemptive scheme and optimal among alldynamic priority scheduling algorithm. This algorithm schedules the taskwith the earliest deadline rst.If the total utilization of the task set is no greater than 1, the task set canbefeasibly scheduled on a single processor by the EDF scheduling algorithm.

    There are several RM and EDF derived algorithms for soft real-time task.But, these algorithms have some drawbacks to cope with continuous tasksin soft real-time tasks related resource utilization and pattern of degradationunder the overloaded situation. Firstly, in continuous tasks, it is not nec-essary for every instance of a repetitive task to meet its deadline. For softreal-time tasks, slight violence of time limits is not so critical. Secondly, RMand EDF scheduling algorithms are required the strict admission control toprevent unpredictable behaviors when the overloaded situation occurs. Thestrict admission control may cause low utilization of resources [4].

    2.2 Soft Real-Time Task Scheduling

    In soft real-time system (ex. telephone switching system, image processing,etc.), the usefulness of results produced by a task decreases over time afterthe deadline expires without causing any damage to the controlled environ-ment [1].

  • 322 M. Gen and M. Yoo

    Recently, rate regulating proportional share(rrPS) and modied propor-tional share (mPS) scheduling algorithm is designed for continuous task insoft real-time tasks.

    Rate regulating proportional share (rrPS) scheduling algorithm. rrPSscheduling algorithm is proposed by Kim et al. [4]. rrPS scheduling al-gorithm is based on the stride scheduler and is proposed to schedulecontinuous tasks.

    The rate regulator, the key concept of the scheduling algorithm, pre-vents certain tasks from taking more resources than its share for a givenperiod.

    This algorithm considers time dependency of continuous media and itkeeps fairness of resource allocation under normal scheduling condition.Even though rrPS scheduling algorithm has several the advantages, it hassome diculties to adapt continuous media as follows.

    First, this algorithm does not show graceful degradation of performanceunder the overloaded condition.

    Second, this algorithm also has the possibility of avoidable contextswitching overhead.

    Modied proportional share (mPS) scheduling algorithm. mPS schedulingalgorithm is proposed by Yoo et al. [2]. This scheduling algorithm considersthe ratio of resource allocation in both normal condition and overloadedcondition.

    This scheduling algorithm shows better performance than rrPS schedul-ing algorithm for graceful degradation of performance under the overloadedcondition and fewer context switching. However, computational burdenand solution accuracy of mPS could be improved by new algorithm basedon genetic algorithm (GA).

    3 Hybrid Genetic Algorithm

    Genetic algorithm (GA) has been used already for scheduling problem inmanufacturing system such as job shop scheduling, ow shop scheduling andmachine scheduling. GA has been theoretically and empirically proved to pro-vide a robust search in complex search spaces. Having been established as avalid approach to the complex problems requiring eective search, GA is nownding more widespread application in business, scientic, and engineeringcircles. The reasons behind the growing numbers of applications are clear.This algorithm is computationally simple and also powerful in its search forthe improvement of the solution. In this section, the basic concept of GAand the expanded multiobjective optimization problem is explained.

    3.1 Basic of Genetic Algorithm

    The general form of GA was described by Goldberg [12]. GAs are stochasticsearch algorithms based on the mechanism of natural selection and natural

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 323

    Fig. 1. General structure of GA

    genetics. GAs, diering from conventional search techniques, start with an ini-tial set of random solutions called population. Each individual in the popula-tion is called a chromosome, encoding a solution to the problem at hand. Thechromosomes evolve through successive iterations, called generations. Dur-ing each generation, the chromosomes are evaluated, using some measuresof tness [13]. To create the next generation, new chromosomes, called o-spring, are formed by either merging two chromosomes from current genera-tion using a crossover operator or modifying a chromosome using a mutationoperator. A new generation is formed by the selection of good individualsaccording to their tness values. After several generations, the algorithm con-verges to the best individual, which hopefully represents the optimal solutionor near-optimal solution for the problem. Figure 1 shows a general structureof GA [14].

    The general implementation structure of GA is described as follows. Inthis procedure, P (t) and C(t) is parents and ospring in current generation t,

    procedure 3.1: genetic algorithminput: problem data and GA parametersoutput: a best solutionbegint 0;initialize P (t) by encoding routine;evaluate P (t) by decoding routine;while (not termination condition) docrossover P (t) to yield C(t);mutation P (t) to yield C(t);evaluate C(t) by decoding routine;select P (t+1) from P (t) and C(t);t t+1;endoutput a best solution;end

  • 324 M. Gen and M. Yoo

    During iteration t, the GA maintains a population P (t) of solution. Eachindividual represents solution to the problem at hand. Each solution is eval-uated by computing a measure of tness of the solution. Some individualsundergo stochastic transformations by means of genetic operators such ascrossover and mutation operators in order to form new individuals, and new in-dividuals called ospring C(t) are then evaluated. A new population is formedby selecting good individuals from the parent population and the ospringpopulation.

    3.2 Multiobjective Optimization Problems

    During the last two decades, the genetic algorithms have received considerableattention regarding their potential as a novel approach to the multiobjectiveoptimization problems, known as evolutionary multiobjective optimization orgenetic multiobjective optimization.

    Multiobjective optimization problem with q objectives and m constraintswill be formulated as follows:

    max {z1 = f1(x), z2 = f2(x), . . . , zq = fq(x)} (1)s.t. gi(x) 0, i = 1, 2, . . . ,m, (2)

    The Concept of Pareto Solution

    In the most existing methods, the Pareto solutions are identied at each gener-ation and used to calculate tness values or ranks for each chromosome only.The Pareto solution is based on nondominated concept. For a given pointz0 Z, it is nondominated if and only if there does not exist another pointz Z such that, for the maximization case,

    zq z0q , for all q, (3)zq > z

    0q , for at least one q, (4)

    where, z0 is a dominated point in the criterion space Z.No mechanism is provided to guarantee that the Pareto solutions generated

    during the evolutionary process enter the next generation. A special pool forpreserving the Pareto solutions is added onto the basic structure of geneticalgorithms. At each generation, the set of Pareto solutions E(t) is updatedby deleting all dominated solutions and adding all newly generated Paretosolutions [15].

    The overall structure of the approach is given as follows:

    procedure 3.2: Pareto genetic algorithmsinput: Problem data and GA parametersoutput: a compromised solution

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 325

    begint 0;initialize P (t) by encoding routine;objective P (t) by decoding routine;create Pareto E(t);tness eval(P ) by decoding routine;while (not termination condition) docrossover P (t) to yield C(t);mutation P (t) to yield C(t);objective C(t) by decoding routine;update Pareto E(P,C);tness eval(P,C) by decoding routine;selection P (t+1) from P (t) and C(t);t t +1;endoutput a compromised solutionend

    Adaptive Weight Approach

    Gen and Cheng proposed an adaptive weights approach which utilizes someuseful information from current population to readjust weights in order toobtain a search pressure towards to positive ideal point [16].

    For the examined solutions at each generation, two extreme points aredened: the maximum extreme point z+ and the minimum extreme pint z

    in criteria space as the follows:

    z+ = {zmax1 , zmax2 , . . . , zmaxq }, (5)z = {zmin1 , zmin2 , . . . , zminq }, (6)

    where zmink and zmaxk are the maximal value and minimal value for objective

    k in current population. Let P denote the set of current population. For agiven individual x, the maximal value and minimal value for each objectiveare dened as the follows:

    zmaxk = max{fk(x) | x P}, k = 1, 2, . . . , q, (7)zmink = min x{fk(x) | x P}, k = 1, 2, . . . , q. (8)

    The hyper parallelogram dened by the two extreme points is a minimalhyper parallelogram containing all current solutions. The two extreme pointsare renewed at each generation. The maximum extreme point will graduallyapproximate to the positive ideal point. The adaptive weight for objective kis calculated by the following equation:

    wk =1

    zmaxk zmink, k = 1, 2, . . . , q. (9)

  • 326 M. Gen and M. Yoo

    For a given individual x, the weighted-sum objective function is given bythe following equation:

    z(x) =q


    wk(zk zmink ), (10)



    zk zminkzmaxk zmink

    , (11)



    fk(x) zminkzmaxk zmink

    . (12)

    As the extreme points are renewed at each generation, the weights arerenewed accordingly. Equation (10) is a hyperplane dened by the followingextreme point in current solutions:

    (zmax1 , zmin2 , . . . , z

    mink , . . . , z

    minq ), (13)

    (zmin1 , zmin2 , . . . , z

    maxk , . . . , z

    minq ), (14)

    (zmin1 , zmin2 , . . . , z

    mink , . . . , z

    maxq ). (15)

    3.3 Hybrid Multiobjective Genetic Algorithm

    The convergence speed to the local optimum of the GA can be improvedby adopting the probability of simulated annealing (SA). The SA means thesimulation of the annealing process of metal. If the temperature is loweredcarefully from a high temperature in the annealing process, the melted metalwill produce the crystal at 0K. Kirkpatrick developed an algorithm that ndsthe optimal solution by substituting the random movement of the solutionfor the uctuation of a particle in the system in the annealing process andmaking the objective function value correspond to the energy of the system,which decreases (involving the temporary increase by Boltzmans probability)with the descent of temperature [17, 18]. Even though the tness functionvalue of newly produced strings is lower than those of current strings, thenewly produced ones are fully accepted in the early stages of the searchingprocess. However, in later stages, a string with a lower tness function valueis seldom accepted. The procedure of improved GA by the probability of SAwill be written as follows:

    procedure 3.3: Improving of GA chromosome by the probabilityof SAinput: parent chromosome V , proto-ospring chromosomes V ,temperature T , cooling rate of SA output: ospring chromosomes V beginr random[0,1];

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 327

    E eval(V )-eval(V );if (E > 0 r < Exp(E/T )) thenV V ;elseV V ;T T x ;output ospring chromosomes V end

    In this procedure, V and V are mean parent chromosome and proto-ospring chromosome. V means ospring chromosome which produced bythis procedure. The T means the temperature and the means the cooling rateof SA.

    The procedure of hybrid multiobjective GA combined with SA will bewritten as followings:

    procedure 3.4: Hybrid multiobjective GA combined SAbegint 0;initialize P (t);objective P (t);create Pareto E(t);tness eval(P );while (not termination condition) docrossover P (t) to yield C(t);mutation P (t) to yield C(t);objective C(t);update Pareto E(P,C);tness eval(P,C);selection P (t+1) from P (t) and C(t);t t+1;endend

    4 Continuous Task Scheduling

    The availability of inexpensive high-performance processors has made itattractive to use multiprocessor systems for real-time applications. The pro-gramming of such multiprocessor systems presents a rather formidable prob-lem. In particular, real-time tasks must be serviced within certain preassigneddeadlines were dictated by the physical environment in which the multiproces-sor systems operates [19].

    In this section, a new scheduling algorithm for soft real-time tasks on mul-tiprocessor systems using GA [20] is introduced. Especially, this algorithm is

  • 328 M. Gen and M. Yoo

    focused on the scheduling for continuous tasks that are periodic and nonpre-emptive. The objective of this scheduling algorithm is to minimize the totaltardiness.

    Some drawbacks (i.e. low resource utilization and avoidable context switch-ing overhead) of RM [3] and EDF [3] derived algorithms for soft real-time taskscould be xed in introduced algorithm. Not only advantages of RM and EDFapproaches but plus side of GA, such as, high speed, parallel searching andhigh adaptability is kept.

    4.1 Continuous Task Scheduling Problem and MathematicalModel

    The continuous task scheduling problem is dened as determining the execu-tion schedule of continuous media tasks with minimizing the total tardinessunder the following conditions:

    All tasks are periodic. All tasks are nonpreemptive. Only processing requirements are signicant; memory, I/O and other re-

    source requirements are negligible. All tasks are independent. This means that there are no precedence con-

    strain. The deadline of a task is equal to its period. Systems are multiprocessor soft real-time systems.

    Figure 2 represents the example of a scheduling for soft real-time tasks onmultiprocessor systems, graphically. Where, i is task index, ci is computationtime of ith task, pi is period of ith task and ij is jth executed task of ith task.

    Fig. 2. Example of continuous soft real-time tasks scheduling on multiprocessorsystem

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 329

    In Fig. 2, the serviced unit time of 31 is 2 and smaller then the computationtime of 31. It means that a tardiness has occurred in 31 and the tardinessis 1. However, the other tasks keep their deadlines.

    The continuous soft real-time tasks scheduling problem on multiprocessorsystems can be formulated as follows:

    min F (s) =Ni=1


    max{0, (sij + ci dij)}, (16)

    s. t. rij sij < dij , i, j. (17)In above equations, notations are dened as follows:

    Indicesm : processor index, m = 1, 2, c Mi : task index, i = 1, 2, c Nj : jth executed task, j = 1, 2, c, ni

    ParametersM : total number of processorsN : total number of tasksij : jth executed task of ith taskci : computation time of ith taskpi : period of ith taskT : scheduled timeni : total number of executed times for ith task

    ni =[T


    ], i = 1, 2, . . . , N, (18)

    rij : jth release time of ith task

    rij ={

    0 j = 1dij1, j = 2, 3, . . . , ni

    i (19)(20)

    dij : jth deadline time of ith task

    dij = rij + pi, i = 1, 2, . . . , N, j = 1, 2, . . . ni (21)

    Decision variablesij : jth start time of ith task

    Equation (16) is the objective function and means to minimize the totaltardiness as shown Fig. 3. Equation (17) is the constraint of this problem andmeans that all tasks can start their computation between their release timeand deadline.

  • 330 M. Gen and M. Yoo

    Fig. 3. Occurrence of tardiness

    4.2 GA Approach

    The encoding, decoding algorithm and genetic operations considering tasksperiods is introduced for discussions.

    Encoding and Decoding

    A chromosome Vk = {vl}, k = 1, 2, c, popSize, represents the relation oftasks and processors. Where popSize is total number of chromosomes in eachgeneration. The locus of lth gene represents the order of tasks and the executedtask and the value of gene vl represents the number of the assigned processor.The length of a chromosome L can be calculated as follows:

    L =Ni=1

    ni. (22)

    Figure 4 represents the structure of a chromosome for the proposed geneticalgorithm. The task 11, 12 and N1 are assigned to processor 1, 3 and 1,respectively.

    Encoding and Decoding procedures can be explained as:

    procedure 4.1: Period-based encodingstep 1: Calculate L and set l=1. L is the length of a chromosome.step 2: Generate a random number r from the range [0..M ] for lth gene.step 3: Increase l by 1 and repeat steps 23 until l = L.step 4: Output the chromosome and stop.

    procedure 4.2: Period-based decodingstep 1: Create Sm by grouping tasks with same processor number,m = 1, 2, c,M . Sm is scheduling task set on mth processor.step 2: Sort tasks in Sm by the increasing order of the release time rij .step 3: Create the schedule and calculate tardiness.step 4: Output the schedule set and total tardiness and stop.

    Fitness Function and Selection

    The tness function is essentially the objective function for the problem.It provides the means of evaluating the search node and it also controls the

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 331

    Fig. 4. Structure of a chromosome

    Fig. 5. Example of the mPUX

    selection process [21]. The tness function used for this GA is based on theF (s) of the schedule. Because the roulette wheel selection is used, the mini-mization problem is converted to the maximization problem, that is, the usedevaluation function is then

    eval(VK) = 1/F (s),k. (23)

    Selection is the main way GA mimics evolution in natural systems. Thecommonly strategies called roulette wheel selection [14,22] has been used.

    Genetic Operators

    The period unit crossover is proposed in this algorithm. This operator cre-ates two new chromosomes (the ospring) by mating two chromosomes (theparents), which are combined as shown Fig. 5. The periods of each task areselected by random number j and each ospring chromosome is built byexchanging selected periods between parents. Where V1 and V

    2 means the

    ospring 1 and 2, respectively. The procedure will be follows as:

    procedure 4.3: Multiperiod unit Crossover (mPUX)step 1: Generate a random number j from the range [1..ni], i = 1, 2, c, N .step 2: Produce ospring chromosomes by exchange the processor numberof the task ij between parents.step 3: Output ospring chromosomes and stop.

    For another GA operator, mutation, the classical one-bit altering mutation[23] is used.

  • 332 M. Gen and M. Yoo

    4.3 Numerical Results

    For the validation of the period based Genetic Algorithm (pd-GA), severalnumerical tests are performed. The pd-GA is compared with OhWus algo-rithm [24] by Oh and Wu and Monniers algorithm by Monnier et al. [22].OhWus algorithm and Monniers algorithm use GA. However, these algo-rithms are designed for discrete tasks and use two dimensional chromosomes.

    For numerical test, tasks are generated randomly based on exponentialdistribution and normal distribution as follows. Random tasks have been usedby several researchers in the past [22].

    cEi = random value based on exponential distribution with mean 5cNi = random value based on normal distribution with mean 5rE = random value based on exponential distribution with mean cEirN = random value based on normal distribution with mean cNipEi = c

    Ei + r


    pNi = cNi + r

    N ,

    where cEi and cNi is the computation time of ith task based on exponential dis-

    tribution and normal distribution, respectively. pEi and pNi is the period of ith

    task based on exponential distribution and normal distribution, respectively.The parameters were set to 0.7 for crossover (pC ,), 0.3 for mutation (pM ,),

    and 30 for population size (popSize). Probabilities for crossover are testedfrom 0.5 to 0.8, from 0.001 to 0.4 for mutation, with the increments 0.05 and0.001, respectively. For population size, individuals from 20 to 200 are tested.Each combination of parameters is tested 20 times, respectively. The bestcombination of parameters is selected by average performance of 20 runs.Figures 6 and 7 show the best result based on best parameter combinationgraphically.

    Numerical tests are performed with 100 tasks. Figures 6 and 7 show thecomparisons of results by three dierent scheduling algorithms. In these g-ures, the total tardiness of the pd-GA is smaller than that of other algorithms.

    Fig. 6. Comparison of results (exponential)

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 333

    Fig. 7. Comparison of results (normal)

    Table 1. Numerical data (total tardiness) of the Figs. 6 and 7

    Algorithm Total number of processorsExponential Normal8 15 8 17

    Oh-Wus algorithm 86 7 103 2Monniers algorithm 85 12 117 8pd-GA 81 0 97 0

    Table 2. Comparison of other algorithms in terms of better, worse and equal per-formance (exponential)

    Algorithm pd-GA Total< = >

    Oh-Wus algorithm 2 9 9 20Monniers algorithm 1 8 11 20

    Table 3. Comparison of other algorithms in terms of better, worse and equal per-formance (normal)

    Algorithm pd-GA Total< = >

    Oh-Wus algorithm 2 8 10 20Monniers algorithm 0 8 12 20

    Table 1 shows numerical data of the Figs. 6 and 7.Tables 2 and 3 are the comparison of results in terms of better, worse

    and equal performance. In Table 2, pd-GA performed better than OhWusalgorithm in nine cases and Monniers algorithm in 11 cases. In Table 3, pd-GA performed better than OhWus algorithm in 10 cases and Monniersalgorithm in 12 cases.

  • 334 M. Gen and M. Yoo

    5 Real-Time Task Scheduling in HomogeneousMultiprocessor

    The optimal assignment of tasks to multiprocessor is, in almost all practicalcases, an NP-hard problem. Monnier et al. presented a GA implementationto solve a real-time nonpreemptive task scheduling problem [22]. The cost ofa schedule is the sum of tardiness of tasks without any successor. Its onlyobjective is to nd a zero tardiness schedule. This approach has a weakness inthat deadline constraints of tasks with successors are not considered. Thesealgorithms have only one objective such as minimizing cost, end time, totaltardiness.

    Oh and Wu presented a GA for scheduling nonpreemptive soft real-timetasks on multiprocessor [24]. They deal two objectives which are to mini-mize the total tardiness and total number of processors used. However thisalgorithm didnt refer about coniction between objectives, so called ParetoOptimum, and has some questions for simulation.

    In this section, a new scheduling algorithm for nonpreemptive soft real-time tasks on multiprocessor without communication time using multiobjec-tive Genetic Algorithm (moGA) is introduced. The objective of this schedulingalgorithm is to minimize the total tardiness and total number of processorsused. For these objectives, this algorithm is combined with Adaptive WeightApproach (AWA) that utilizes some useful information from the current pop-ulation to readjust weights for obtaining a search pressure toward a positiveideal point [23].

    5.1 Soft Real-Time Task Scheduling Problem (sr-TSP)and Mathematical Model

    The problem of scheduling the tasks of precedence and timing constrainedtask graph on a set of homogeneous processors is considered in a way that si-multaneously minimizes the number of processors used and the total tardinessunder the following conditions:

    All tasks are nonpreemptive. Every processor processes only one task at a time. Every task is processed on one processor at a time. Only processing requirements are signicant; memory, I/O, and other re-

    source requirements are negligible.

    The problem is formulated under the following assumptions: Computationtime and deadline of each task is known. A time unit is articial time unit.Soft real-time tasks scheduling problem (sr-TSP) is formulated as follows:

    min f1 = M, (24)

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 335

    min f2 =Ni=1

    max{0, tsi + ci di}, (25)

    s.t. tEi tSi di, i, (26)tEi tEj + ci, Jj pre(Ji), i (27)1 M N. (28)

    In above equations, notations are dened as follows:

    Indicesi, j : task index, i, j = 1, 2, c, Nm : processor index, m = 1, 2, c,M

    ParametersG = (T,E) : task graphT = {1, 2, c, N} : a set of N tasksE = {eij}, i, j = 1, 2, c, N, i = j : a set of directed edges among the tasksrepresenting precedencei : ith task, i = 1, 2, c, Npi : mth processor, m = 1, 2, c,Mci : computation time of task idi : deadline of task ipre (i) : set of all predecessors of task isuc (i) : set of all successors of task ipre(i) : set of immediate predecessors of task isuc(i) : set of immediate successors of task itEi : earliest start time of ith task

    tEi =

    {0, if j : eji E


    {tEj + cj

    }otherwise i (29)

    tLi : latest start time of ith task

    tLi =

    {di ci, if j : eij E

    min{ minjsuc(i)

    {tLj cj}, di ci

    }otherwise i (30)

    Decision variablestSi : real start time of ith taskM : total number of processors used

  • 336 M. Gen and M. Yoo

    Equations (25) and (26) are the objective function in this scheduling prob-lem. In (25) means to minimize the total number of processors used and (26)means to minimize total tardiness of tasks. Constraints conditions are shownfrom (27) to (29). Equation (27) means that task can be started after itsearliest start time, begin its deadline. Equation (28) denes the earliest starttime of task based on precedence constraints. Equation (29) is nonnegativecondition for the number of processors.

    5.2 GA Approach

    Several new techniques are proposed in the encoding and decoding algorithmof genetic string and the genetic operations are introduced for discussion.

    Encoding and Decoding

    A chromosome Vk, k = 1, 2, c, popSize, represents one of all the possible map-pings of all the tasks into the processors. Where popSize is the total numberof chromosomes in a generation. A chromosome Vk is partitioned into twoparts u(), v(), u () means scheduling order and v() means allocation infor-mation. The length of each part is the total number of tasks. The schedulingorder part should be a topological order with respect to the given task graphthat satises precedence relations. The allocation information part denote theprocessor to which task is allocated.

    Encoding procedure is composed of two strategies: strategy I for u() andstrategy II for v(). Procedures will be written as follows:

    procedure 5.1: Encoding Strategy I for sr-TSPinput: task graph data setoutput: u()beginl 1, w ;while (T = )w w arg{i | pre*(i) =, i};T T - {i}, i w;while (w = )j random(w);u(l) j;l l+1;w w - {j};pre*(i) pre*(i) - {j}, i;endendoutput u();end

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 337

    Fig. 8. Example of encoding strategy I procedure

    Figure 8 represents the sample of encoding strategy I procedure.

    procedure 5.2: Encoding Strategy II for sr-TSPinput: task graph data set, u(), , , M

    M ={

    M(k 1), if 1 < k papSize| subgraph |, if k = 1

    output: v(), Mkbeginl 1, tm 0, idle 0;while(l=N)m random[1,M];i u(l);if (tm < tEi ) thentSi tEi ;idle idle + (tSi - tm);elsetSi tm;if (di is not dened && tSi > t

    Li ) tSi > di) then

    if (idle/ci < ) thenM M +1;m M ;idle idle + tEi ;tm tEi + ci;else

  • 338 M. Gen and M. Yoo

    idle max{0, (idle -ci)};elsetm tSi +ci;v(l) m;l l+1;idle idle + (max{ tm }- tm);endwhile (idle/

    M X max{tm}> )

    M M -1;idle idle - idle/ M x max{tm};endoutput v(), Mk;end

    In encoding strategy II procedure, , is boundary constant to decideincreasing the number of processor and decreasing the number of processor,respectively.

    Figure 9 represents the example of encoding strategy II procedure.Decoding procedure is will be written as follows:

    procedure 5.3: Decoding for sr-TSPinput: task graph data set, chromosome u(), v()output: schedule set S, the total number of processor used f1, totaltardiness of tasks f2beginl 1, tm 0, m, idlem , m, f1 0, f2 0, S ;while (l = N) doi u(l);m v(l);if (tm =0) then f1 f1 +1;IS, IF nd {IS , IF | (IS ,IF ) idlem, IS = di};if (IS is exist && tm > tLi ) then insert(i);else start(i);add idle();f2 f2 + max{0,(tSi +ci - di)};S S {(i,m: tSi - tFi )};l l+1;endoutput S, f1, f2;end

    where insert(i) means to insert i at idle time if i is computable in idle time,start(i) means to assign ji to maximum nish time of all assigned task to pm,add idle() means to add idle time to idle time list if idle time is occurred. IS

    means the start time of idle duration, IF means the end time of idle duration,

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 339

    Fig. 9. Example of encoding strategy II procedure

    Fig. 10. Example of decoding procedure

    idlem means the list of idle time and tm means the maximum nish time ofall assigned task to pm.

    Figure 10 represents the example of decoding procedure with chromosomein Figs. 8 and 9.

    Evolution Function and Selection

    The multi-objective optimization problems have been receiving growing in-terest from researchers with various backgrounds since early 1960. Recently,

  • 340 M. Gen and M. Yoo

    GAs have been received considerable attention as a novel approach to mul-tiobjective optimization problems, resulting in a fresh body of research andapplications known as genetic multi-objective optimizations [25].

    Adaptive weight approach (AWA) [23] that utilizes some useful informa-tion from the current population to readjust weights for obtaining a searchpressure toward a positive ideal point is combined in this scheduling algorithm.

    The evaluation function is designed as follows:

    eval(Vk) = 1/F (Vk) (31)



    fmaxq fminq. (32)

    For selection, the commonly strategy called roulette wheel selection [14],[22] has been used.

    GA Operators

    The one-cut crossover is used. This operator creates two new chromo-somes (the ospring) by mating two chromosomes (the parent). The one-cutcrossover procedure will be written as follows:

    procedure 5.4: One-cut Crossoverinput: parent chromosomes u1(), v1(), u2(), v2()output: proto-ospring chromosomes u1(), v1(), u2(), v2()beginr random [1, N ];u1() u1 ();v1() v1 [1:r] // v2 [r +1:N ];u2() u2 ();v2() v2 [1:r] // v1 [r +1:N ];output ospring chromosome u1(), v1(), u2(), v2();end

    where u(), v() are proto-ospring chromosome. Figure 11 represents theexample of one-cut crossover procedure.

    For another GA operator, mutation, the classical one-bit altering mutation[21] is used.

    5.3 Validation

    To validate proposed moGA, several numerical tests are performed. The in-troduced moGA is compared with Monnier-GA by Monnier et al. [22] andOhWus algorithm by Oh and Wu [24]. Numerical tests are performed withrandomly generated task graph.

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 341

    Fig. 11. Example of one-cut crossover

    Table 4. Computation results three algorithms

    Terms Monnier-GA Oh-Wus algorithm moGA

    # of processors M 38 37 32makespan 149 157 163computing times (msec) 497 511 518average utilization of 0.447582 0.453392 0.567352processors

    P-Method [26] is used for generation task graph. The P-Method of gen-erating a random task graph is based on the probabilistic construction of anadjacency matrix of a task graph. Element aij of the matrix is dened as 1if there is a precedence relation from i to j ; otherwise, aij is zero. An adja-cency matrix is constructed with all its lower triangular and diagonal elementsset to zero. Each of the remaining upper triangular elements of the matrix isexamined individually as part of a Bernoulli process with parameter e, whichrepresents the probability of a success. For each element, when the Bernoullitrial is a success, then the element is assigned a value of one; for a failure theelement is given a value of zero. The parameter e can be considered to be thesparsity of the task graph. With this method, a probability parameter of e=1creates a totally sequential task graph, and e=0 creates an inherently parallelone. Values of e that lie in between these two extremes generally produce taskgraphs that possess intermediate structures.

    Tasks computation time and deadline use generated randomly based onexponential distribution and the parameters of GA is same to those of Sect. 4.

    Numerical tests are performed with 100 tasks. Table 4 shows the compar-isons of results by three dierent scheduling algorithms. There is no tardinessinclusively. The computing time of proposed moGA is a little bit longer thanthose of the other two. However, the number of utilized processors is fewerthan those of the other two algorithms. The variance of processor utilizationrate by moGA is more desirable than those of the others.

  • 342 M. Gen and M. Yoo

    Fig. 12. Pareto solution

    Figure 12 represents the Pareto solution of moGA and those of OhWusalgorithm. In this gure, the Pareto solution curve by moGA is closer to idealpoint than that of OhWus algorithm.

    6 Real-Time Task Scheduling in HeterogeneousMultiprocessor System

    In a heterogeneous multiprocessor system, task scheduling is more dicultthan that in a homogeneous multiprocessor system. Recently, several ap-proaches of the genetic algorithm (GA) are proposed. Theys et al. presenteda static scheduling algorithm using GA on a heterogeneous system [27]. And,Page et al. presented a dynamic scheduling algorithm using GA on a hetero-geneous system [28]. Dhodhi et al. presented a new encoding method of GAfor task scheduling on a heterogeneous system [29]. However, these algorithmsare designed for general tasks without time constraints.

    In this section, a new scheduling algorithm for nonpreemptive tasks witha precedence relationship in a soft real-time heterogeneous multiprocessorsystem [30] is introduced.

    6.1 Soft Real-Time Task Scheduling Problem (sr-TSP)and Mathematical Model

    The problem of scheduling the tasks with precedence and timing constrainedtask graph on a set of heterogeneous processors is considered in a way thatminimizes the total tardiness F (x , tS ). Conditions are same to those of Sect. 5.

    Soft real-time tasks scheduling problem on heterogeneous multiprocessorsystem to minimize the total tardiness is formulated as follows:

    min F (x , tS) =Ni=1



    (tSi + cim di) xim}, (33)

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 343

    s.t. tEi tSi di,i, (34)

    tEi tEj +M


    cjm xjm, Jj pre(Ji),i, (35)


    xim = 1,i, (36)

    xim {0, 1},i,m. (37)

    In above equations, notations are dened as follows:

    Indicesi, j : task index, i, j = 1, 2, c, Nm : processor index, m = 1, 2, c,M

    ParametersG = (T,E) : task graphT = {1, 2, c, N} : a set of N tasksE = {eij}, i, j = 1, 2, c, N, i = j : a set of directed edges among the tasksrepresenting precedence relationshipi : ith task, i = 1, 2, c, Neij : procedure relationship between task i and task jpm : the mth processor, m = 1, 2, . . . ,Mcim : computation time of task i on processor pmdi : deadline of task ipre*(i) : set of all predecessors of task isuc*(i) : set of all successors of task ipre(i) : set of immediate predecessors of task isuc(i) : set of immediate successors of task itEi : earliest start time of task i

    tEi =

    0, if j : eji E


    {tEj +


    cjm xjm}

    , otherwisei (38)

    tFi : nish time of task i

    tFi = min

    {tSi +


    cim xim, di}

    ,i (39)

    Decision variablestSi : real start time of ith task i

  • 344 M. Gen and M. Yoo

    Fig. 13. Time chart of sr-TSP

    xim ={

    1, if processor pmis selected for task i0, otherwise. (40)

    Equation (36) is the objective function in this scheduling problem.Equation (36) means to minimize total tardiness of tasks. Constraintsconditions are shown from (37) to (40). Equation (37) means that task can bestarted after its earliest start time, begin its deadline. Equation (38) denesthe earliest start time of task based on precedence constraints. Equation (39)means that every task is processed on one processor at a time. Figure 13represents the time chart of sr-TSP.

    6.2 GA Approach

    The solution algorithm is based on genetic algorithm (GA). Several new tech-niques are proposed in the encoding and decoding algorithm of genetic stringand the genetic operations are introduced for discussion.

    Encoding and Decoding

    A chromosome Vk, k = 1, 2, c, popSize, represents one of all the possible map-pings of all the tasks into the processors. Where popSize is the total numberof chromosomes in a generation. A chromosome Vk is partitioned into twoparts u(), v(). The u() means scheduling order and the v() means alloca-tion information. The length of each part is the total number of tasks. Thescheduling order part should be a topological order with respect to the giventask graph that satises precedence relationship. The allocation informationpart denote the processor to which task is allocated.

    Encoding procedure for sr-TSP will be written as follows:

    procedure 6.1: Encoding for sr-TSPinput: task graph data set, total number of processors Moutput: u(), v()beginl 1, W ;while(T = )W W arg {i | pre*(i)=, i };

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 345

    T T - {i}, i W ;while (W = )j random(W);u(l) j;W W - {j};pre*(i ) pre*(i ) - {j}, i;m random[1:M ];v(l) m;l l+1;endoutput u(), v();end

    Where, W is temporary dened working data set for tasks without pre-decessors. In encoding procedure, feasible solutions are generated by respect-ing the precedence relationship of task and allocated processor is selectedrandomly.

    Decoding procedure is will be written as bellows.

    procedure 6.2: Decoding for sr-TSPinput: task graph data set, chromosome u(), v()output: schedule set S, total tardiness of tasks Fbeginl 1, F 0, S ;while (l = N)i u(l);m v(l);if (exist suitable idle time)

    theninsert(i);start(i);update idle();F F +max{0,(tSi +cim-di)};S S {(i, m: tSi -tfi )};l l+1;endoutput S, Fend

    Where insert(i) means to insert i at idle time if i is computable in idletime. At start(i), the real start time of ith task tSi and the nish time of ithtask tFi can be calculated. update idle() means that the list of idle time isupdated if new idle time duration is occurred. The objective value F (x, tS)and schedule set S is generated through this procedure.

  • 346 M. Gen and M. Yoo

    Evolution Function and Selection

    The tness function is essentially the objective function for the problem. Itprovides a means of evaluating the search node and it also controls the selec-tion process [23,25].

    The tness function is based on the F (x , tS) of the schedule. The usedevaluation function is then

    eval(Vk) = 1/F (x, tS),k (41)

    Selection is the main way GA mimics evolution in natural systems: tteran individual is, the highest is its probability to be selected. For selection, thecommonly strategies called roulette wheel selection [14,22] has been used.

    GA Operators

    For crossover, the one-cut crossover in Sect. 5 is used. For another GA opera-tor, mutation, the classical one-bit altering mutation [21] is used.

    Improving of Convergence by the Probability of SA

    In this scheduling algorithm, the introduced method for improving of conver-gence by the probability of SA in Sect. 2 is combined.

    6.3 Validation

    To validate proposed hybrid Genetic Algorithm combined Simulated Anneal-ing (hGA+SA), several numerical tests are performed. The hGA+SA is com-pared with Monniers GA and proposed simple GA which is not combined withSA. The Monniers GA is concerned to homogeneous multiprocessor systemand the hGA+SA is designed for heterogeneous multiprocessor system. Asthere are no algorithms which are concerned to heterogeneous multiprocessorsystem, the hGA+SA is compared with Monniers GA on heterogeneous mul-tiprocessor system. The Monniers GA is proposed by Monnier, Beauvais andDeplanche [22]. This algorithm based on simple GA use linear tness normal-ization technique for evaluating chromosomes. The linear tness normalizationtechnique is eective to increase competition between similar chromosomes.However this method is limited in special problem with similar chromosomes.And in this algorithm, insertion method is not used. In other words, althoughthere is idle time, task can not be executed in idle time.

    Numerical tests are performed with randomly generated task graph. P-Method [26] for generation task graph is used. Tasks computation time anddeadline are generated randomly based on exponential distribution. The pa-rameters of GA is same to those of Sect. 4.

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 347

    Monniers GA






    tal t



    total number of processors





    00 2 4 6 8 10 12 14

    simple GA hGA+SA

    Fig. 14. Comparison with three algorithms for F (x, ts)

    Table 5. Comparison with three algorithms

    Terms Monniers GA Simple GA hGA+SA

    # of processors M 13 13 12makespan 123 120 132computing times (msec) 243 245 338average utilization of 0.4334 0.4375 0.5702processors

    Numerical tests are performed with 100 tasks. Figure 14 shows that thecomparison with three algorithms for F (x, tS). In Fig. 14, F (x, tS) ofhGA+SA is smaller that of each algorithms.

    In Table 5, some terms such as makespan, computing time and the uti-lization of processors are compared on the total number of processors withouttardiness. Total number of processors without tardiness of hGA+SA is smallerthan that of other algorithms and the average utilization of processors ofhGA+SA is more desirable than those of the others.

    7 Conclusions

    In this chapter, several scheduling algorithm for soft real-time tasks usinggenetic algorithm (GA) are introduced.

    Several derived algorithms from rate monotonic (RM), earliest deadlinerst (EDF) for hard real-time tasks or some scheduling algorithms such asrate regulating proportional share (rrPS) and modied proportional share(mPS) have been used for soft real-time tasks. However, these algorithms havesome drawbacks in resource utilization and pattern of degradation under the

  • 348 M. Gen and M. Yoo

    overloaded situation. Furthermore, the scheduling on multiprocessor systemis NP-hard problem.

    The introduced algorithms in this chapter use GA. GA has been knownto oer signicant advantages against conventional heuristics by using simul-taneously several search principles and heuristics.

    In the hybrid GA (hGA) combined with simulated annealing (SA), theconvergence of GA is improved by introducing the probability of SA as thecriterion for acceptance of the new trial solution. This hybridization does nothurt own advantages of GA but nds more accurate solutions in later stageof searching process.

    The multiobjective GA for soft real-time task scheduling also is introduced.Not only minimization the total tardiness but also minimization the totalnumber of processor used and the makespan are taken into considerations.However, since these objectives are in conicting (trade-os) relations, thePareto optimum concept is introduced to solution process.

    In conclusion, from introduced scheduling algorithm and their experimentresults we can see that the scheduling algorithm using GA is very promisingapproach for obtaining relatively satisfactory solutions to soft real-time taskscheduling problem, which belong to the dicult NP-hard problem. All ofthe techniques developed for theses problems in this research are useful andapplicable for other scheduling problems. The research eld will be extendedto logistic problem and process planning problem.


    1. Krishna, C. M. and G. S. Kang (1997) Real-Time System, McGraw-Hill, NewYork

    2. Yoo, M. R., B. C. Ahn, D. H. Lee and H. C. Kim (2001) A New Real-TimeScheduling Algorithm for Continuous Media Tasks, Proc. of Computers andSignal Processing, pp.2628.

    3. Liu, C. L. and J. W. Layland (1973) Scheduling Algorithm for Multiprogram-ming in a Hard Real-Time Environment, Journal of the ACM, vol. 20, no. 1,pp. 4659.

    4. Kim, M. H., H. G. Lee and J. W. Lee (1997) A Proportional-Share Schedulerfor Multimedia Applications, Proc. of Multimedia Computing and Systems, pp.484491.

    5. Yoo, M. R. (2002) A Scheduling Algorithm for Multimedia Process, Ph.D. dis-sertation, University of YeoungNam, (in Korean).

    6. Yalaoui, F. and C. Chu (2002) Parallel Machine Scheduling to Minimize TotalTardiness, International Journal of Production Economics, vol. 76, no. 3, pp.265279.

    7. Du, J. and J. Leung (1990) Minimizing Total Tardiness on One Machine isNP-hard, Mathematics of Operational Research, vol. 15, pp. 483495.

    8. Lenstra, J. K., R. Kan and P. Brucker (1997) Complexity of Machine SchedulingProblems, Annals of Discrete Mathematics, pp. 343362.

  • Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 349

    9. Zhu, K., Y. Zhuang and Y. Viniotis (2001) Achieving End-to-End Delay Boundsby EDF Scheduling without Trac Shaping, Proc. of 20th Annual Joint Con-ference on the IEEE Communications Societies, pp. 14931501.

    10. Diaz, J. L., D. F. Garcia and J. M. Lopez (2004) Minimum and Maximum Uti-lization Bounds for Multiprocessor Rate Monotonic Scheduling, IEEE Trans-actions on Parallel and Distributed Systems, vol. 15, no. 7, pp. 642653.

    11. Bernat, G., A. Burns and A. Liamosi (2001) Weakly Hard Real-Time Systems,Transactions on Computer Systems, vol. 50, no. 4, pp. 308321.

    12. Goldberg, D. E. (1989) Genetic Algorithms in Search, Optimization & MachineLearning, Addison-Wesley.

    13. Fogel, D. and A. Ghozeil (1996) Using tness distributions to design moreecient evolutionary computations, Fogel, D., editor, Proc. of the Third IEEEconference on Evolutionary Computation, IEEE Press, Nagoya, Japan, pp. 1119.

    14. Gen, M. and R. Cheng (1997) Genetic Algorithms & Engineering Design, JohnWiley & Sons.

    15. Xu, H. and G. Vukovich (1998) Fuzzy Evolutionary Algorithms and AutomaticRobot Trajectory Generation, Fogel, D. editor, Proc. of the First IEEE Confer-ence on Evolutionary Computation, IEEE Press, Piscataway, NJ, pp. 595600.

    16. Ishii, H., H. Shiode, and T. Murata (1998) A Multiobjective Genetic LocalSearch Algorithm and Its Application to Flowshop Scheduling, IEEE Trans.on Systems, Man and Cybernetics, vol. 28, no. 3, pp. 392403.

    17. Kim, H. C., Y. Hayashi and K. Nara (1997) An Algorithm for Thermal UnitMaintenance Scheduling through combined use of GA, SA and TS, IEEE Trans-actions on Power Systems, vol. 12, no. 1, pp. 329335.

    18. Kirkpatrick, S., C. D. Gelatt and M. P. Vecchi (1983) Optimization by Simu-lated Annealing, Science, vol. 220, no. 4598, pp. 671680.

    19. Denouzos, M. L. and Mok, A. K. (1989) Multiprocessor on-line scheduling ofhard-real-time tasks, IEEE Transactions on Software Engineering, vol. 15, no.12, pp. 392399.

    20. Yoo, Myungryun and M. Gen (2005) Multimedia Tasks Scheduling using Ge-netic Algorithm, Asia Pacic Management Review. vol. 10, no. 6, pp. 373380.

    21. Jackson, L. E. and G. N. Rouskas (2003) Optimal Quantization of Periodic TaskRequests on Multiple Identical Processors, IEEE Transactions on Parallel andDistributed Systems, vol. 14, no. 8, pp. 795806.

    22. Monnier, Y., J. P. Beauvais and A. M. Deplanche (1998) A Genetic Algo-rithm for Scheduling Tasks in a Real-Time Distributed System, Proc. of 24thEuromicro Conference, pp. 708714.

    23. Gen, M. and R. Cheng (2000) Genetic Algorithms & Engineering Optimization,John Wiley & Sons.

    24. Oh, J. and C. Wu (2004) Genetic-algorithm-based Real-time Task Schedulingwith Multiple Goals, Journal of Systems and Software, vol. 71, no. 3, pp. 245258.

    25. Deb, K. (2001) Multi-objective Optimization using Evolutionary Algorithms,John Wiley & Sons.

    26. Al-Sharaeh, S. and B. E. Wells (1996) A Comparison of Heuristics for ListSchedules using The Box-method and P-method for Random Digraph Genera-tion, Proc. of the 28th Southeastern Symposium on System Theory, pp. 467471.

  • 350 M. Gen and M. Yoo

    27. Theys, M. D., T. D. Braun, H. J. Siegal, A. A. Maciejewski and Y. K. Kwok(2001) Mapping tasks onto distributed heterogeneous computing systems usinga genetic algorithm approach, Zomaya, A. Y., F. Ercal and S. Olariu, editors,Solutions to Parallel and Distributed Computing Problems, chapter 6, pp. 135178, Wiley, New York.

    28. Page, A. J. and T. J. Naughton (2005) Dynamic task scheduling using geneticalgorithm for heterogeneous distributed computing, Proc. of 19th IEEE Inter-national Parallel and Distributed Processing Symposium, 189.1.

    29. Dhodhi, M. K., I. Ahmad, A. Yatama and I. Ahmad (2002) An integrated tech-nique for task matching and scheduling onto distributed heterogeneous com-puting systems, Journal of Parallel and Distributed Computing, vol. 62, pp.13381361.

    30. Yoo, Myungryun and M. Gen (2005) Multiobjective genetic algorithm for real-time task scheduling in heterogeneous multiprocessors system, 6th InternationalSymposium on Advanced Intelligent Systems, Yeosu in Korea, pp. 838843.

  • Computational Intelligence in Visual SensorNetworks: Improving Video ProcessingSystems

    Miguel A. Patricio, F. Castanedo, A. Berlanga, O. Perez, and J. Garca,and Jose M. Molina

    Applied Articial Intelligence Group, Universidad Carlos III de Madrid, Avda.Universidad Carlos III, 22. 28270 Colmenarejo, Madrid, Spain,mpatrici@inf.uc3m.es, fcastane@inf.uc3m.es, opconcha@inf.uc3m.es,

    jgherrer@inf.uc3m.es, aberlan@ia.uc3m.es, molina@ia.uc3m.es

    Summary. In this chapter we will describe several approaches to develop videoanalysis and segmentation systems based on visual sensor networks using compu-tational intelligence. We review the types of problems and algorithms used, andhow computational intelligence paradigms can help to build competitive solutions.computational intelligence is used here from an engineering point of view: thedesigner is provided with tools which can help in designing or rening solutions tocope with real-world problems. This implies having an a priori knowledge of thedomain (always imprecise and incomplete) to be reected in the design, but with-out accurate mathematical models to apply. The methods used operate at a higherlevel of abstraction to include the domain knowledge, usually complemented withsets of pre-compiled examples and evaluation metrics to carry out an inductivegeneralization process.

    1 Introduction

    Processing multimedia information is getting more and more important invideo surveillance and sensor networks [1]. The particular conditions to oper-ate this type of systems require from quite specialized solutions. The trackingalgorithms used to segment multimedia and video data must handle complexsituations such as objects interactions and occlusions, sudden manoeuvres,etc. and they are usually the most exible and parametrical part of visionsystems. Practically all systems exploit external information to model thescene, objects behavior, context, etc. The conguration is done aiming at atrade-o between computational resources and system performance.

    Multimedia surveillance systems are a new generation of architectural sys-tems where many dierent media streams will concur to provide an automaticanalysis of the controlled environment and a real-time interpretation of the

    M.A. Patricio et al.: Computational Intelligence in Visual Sensor Networks: Improving Video

    Processing Systems, Studies in Computational Intelligence (SCI) 96, 351377 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 352 M.A. Patricio et al.

    scene [2]. Among the whole multimedia sources (images, audio, sensor sig-nals, textual data, etc.), video is the most powerful media stream to gathersurveillance information.

    Current video surveillance systems [3] are conceived to deal with a largenumber of cameras. The challenge of extracting useful data from a visual sur-veillance system could become an immense task if it stretches to a sizablenumber of cameras. Consequently, content-based retrieval of video data turnsout to be a challenging and important problem. In this chapter, we presenthow computational intelligence paradigms are applied to infer semantic infor-mation automatically from raw video data. More precisely, we will show theapplication of computational intelligence techniques, within the framework ofvisual sensor networks, to the improvement of the video procedures: from de-tection to tracking process. In next sections, we will show real developmentsof computational intelligence in video surveillance.

    2 Related Works

    2.1 Visual Sensor Networks

    Visual sensor networks [3] are related to spatially distributed multi-sensorenvironments which raise interesting challenges for surveillance. These chal-lenges concern to data fusion techniques to deal with the sharing of informa-tion gathered from dierent types of sensors [4], communication aspects [5],security of communications [5] and sensor management. These new systemsare called third-generation surveillance system, which would provide highlyautomated information, as well as alarms and emergencies management.PRISMATICA [6] is an example of these systems. It consists of a network ofintelligent devices that process sensor inputs. These devices send and receivemessages to/from a central server module. The server module co-ordinates de-vice activity, archives/retrieves data and provides the interface with a humanoperator. The design of a surveillance system with no server to avoid thiscentralization is reported in [4]. As part of the VSAM project, [4] presentsa multi-camera surveillance system based on the same idea as [7]: the cre-ation of a network of smart sensors that are independent and autonomousvision modules. The surveillance systems described above take advantage ofprogress in low-cost high-performance processors and multimedia communi-cations. However, they do not account for the possibility of fusing informationfrom neighboring cameras.

    Third generation surveillance systems [8] is the term usually used in the lit-erature to refer to systems conceived to deal with a large number of cameras,a geographical spread of resources, many monitoring points, and to mirrorthe hierarchical and distributed nature of the human process of surveillance.From an image processing point of view, they are based on the distribution of

  • Computational Intelligence in Visual Sensor Networks 353

    processing capacities over the network and the use of embedded signal process-ing devices to give the advantages of scalability and potential robustness ofdistributed systems.

    A multiagent visual sensor network is a distributed network of several in-telligent software agents with visual capabilities [3]. An intelligent softwareagent is a computational process that has several characteristics [9], (1) re-activity (allowing agents to perceive and respond to a changing environ-ment), (2) social ability (by which agents interact with other agents) and(3) proactiveness (through which agents behave in a goal-directed fashion).Wooldridge and Jennings also give a strong notion of agent which also usesmental components such as belief, desire and intentions (BDI).

    The main goals that are expected from a generic third generation visionsurveillance application, based on end-user requirements, are to provide goodscene understanding, oriented to attract the attention of the human operatorin real time.

    2.2 Intelligent Visual Tracking Systems

    Intelligent visual tracking systems (IVTS) track all the targets moving withinits local eld of view. The IVTS implementation are arranged in a pipe-linestructure of several modules, as shown in Fig. 1; it directly interfaces with theimage stream coming from a camera and extract the track information of themobile objects in the current frame. The interface between adjacent modulesis symbolic data and it is set up so that for each module dierent algorithmsare interchangeable.

    The main modules of the IVTS implementation are: (1) a detector processof moving objects; (2) an association process; (3) a prediction process;(4) blob1 deleter; (5) track updater.

    The detector process (1) of moving objects must give a list of blobs thatare found in a frame, this list must contain information about the position andsize of each blob. Within the tracking process and continuing with the list ofblobs obtained by the previous module, the association process (2) will solvethe problem of blob-to-track multi-assignment, where several (or none) blobsmay be assigned to the same track and simultaneously several tracks couldoverlap and share common blobs. So, the association problem to solve, is thedecision of the most proper grouping of blobs and the assignation to each trackfor each frame processed. The prediction process (3) uses the association made

    Fig. 1. Intelligent visual tracking system implementation

    1 Detected pixels which forms compact regions.

  • 354 M.A. Patricio et al.

    by the tracking process and predicts where each track will move to during thenext frame, this prediction will be used by the tracking process in order tomake the association. The blob deleter (4) module eliminates those blobs thathave not been associated to any track, thus they are considered to be noise.The last main module, the track updater (5), updates the tracks obtained inthe last frame, with the information obtained from the previous modules forthis frame.

    A key aspect to have a IVTS implementation is a robust movement seg-mentation. Precisely, this has been the objective of many research works [10].Although plenty of techniques have been applied for video segmentation, itis still a dicult and unresolved problem in the general case and under com-plex situations. The basic aspects to address are: extraction of moving objectsfrom the background and precise separation of individual objects when theirimages appear close to each other [11].

    2.3 Data Association Process

    Tracking multiple visual targets involving occlusion and varying number prob-lems is a challenging problem in IVTS. A primary task of the multi-targettracking (MTT) system is data association, namely, partitioning the measure-ments into disjoint sets, each generated by a single source (target or clutter).Target splitting and merging distinguish video data processing with respectto other sensor data sources, forcing the data association (or correspondence)task to demand for powerful and specic techniques.

    Although plenty of techniques have been researched for video segmenta-tion, it is still a dicult and not resolved problem in the general case withreal situations. Detected pixels are rst connected to form compact regionsreferred to as blobs. The tracker should re-connect these blobs to segment alltargets from background and track their motion, applying association and l-tering processes [12]. Usual problems are clutter (false objects such as smoke,waving trees, etc.), occlusions, shadows, splits of objects in regions, and merg-ings of dierent objects due to overlaps.

    Figure 2 illustrates an example, where two targets (aircraft moving onparallel airport taxiways) are the source of several blobs separated from thebackground. The blobs from each aircraft should be grouped to track theindividual trajectories, even while the partial occlusion, and false blobscorresponding to smoke should be wipped-o.

    The problem to solve, known as data association [13], is the decision ofthe most proper grouping of blobs and assignment to tracks for each frameprocessed. The performance of nal system critically depends on the trade-o considered in data association. Next we briey formulate this problem,describe the existing approaches, and we will describe our proposals to exploitcontextual information of visual trackers using dierent CI paradigms such asfuzzy rules and generalization through evolutionary computation of heuristicfunctions.

  • Computational Intelligence in Visual Sensor Networks 355

    Fig. 2. Blob-to-track association problem

    Although visual tracking has been extensively studied, most works assumethat motion correspondence problem is solved during image segmentation or istrivial, so that a simple strategy such as nearest neighbor (NN) is applied. Theproblem of object split and merging has recently received a wider attentionby the machine-vision community, from dierent points of view. Conventionaldata association systems, such as NN, MHT [14] or S-D [15] deal the problemas minimizing a global cost function in a combinatorial space. As alternative,an all-neighbors approach such as Joint Probabilistic Data Association orPMHT [16], have been also applied to this problem, all blobs gated with eachtrack are used to update it, requiring besides lower memory and computation.

  • 356 M.A. Patricio et al.

    Some proposals apply lower-level image information to address the problem.For instance, w4 system [17] is based on low-level correlation operators toresolve occlusions and merging in people-group tracking.

    3 Multiagent Visual Sensor Network: Overview

    In [3], authors have developed a novel multiagent framework for delibera-tive camera-agents forming visual sensor networks. In this framework, eachcamera is represented and managed by an individual software agent, calledsurveillance-sensor agent (SSA). SSAs are located at the same level (sensorlayer), so that it allows coordination execution among SSAs. Each SSA knowsonly part of the information (partial knowledge due to its limited eld ofview), and has to make decisions with this limitation. Furthermore, each SSAtracks all the targets moving within its local eld of view. The distributednessof this type of systems supports the SSAs proactivity, and the cooperationrequired among these agents to accomplish surveillance justies the sociabil-ity of surveillance-sensor agents. The details of the multiagent visual sensornetwork architecture are described formally and more extensive in [1, 3, 18].

    3.1 Cooperative Surveillance Multiagent Architecture

    In order to provide a good understanding of the environment each processinvolved in the surveillance system (in our case agents) has to reason aboutthe actions that take in each moment. This level of reasoning is not possibleat low level image processing algorithms. Therefore a multiagent system isnecessary in order to provide the reasoning capabilities.

    Using a multiagent architecture for video surveillance provides several ad-vantages. First of all, the loosely coupled nature of the multi-agent architec-ture allows more exibility for the communication processes. Also the abilityto assign responsibilities for each agent is ideal for solving complex tasks ina surveillance system. This complex tasks involves the use of mechanismssuch as coordination, dynamic conguration and cooperation that are widelystudied in the multiagent community.

    Intelligence in articial vision systems, such as our propose framework[1, 3, 18], operates at dierent logical levels. In the rst level, the processof scene interpretation from each sensor is carried out by a surveillance-sensor agent. As a second level, the information parsed by each individualsurveillance-sensor agent is collected and fused. The fusion process is car-ried out by a fusion agent in the multiagent surveillance system. Finally, thesurveillance process is distributed over several surveillance-sensor agents, ac-cording to their individual ability to contribute with their local informationto a desired global solution.

  • Computational Intelligence in Visual Sensor Networks 357

    Fig. 3. CS-MAS architecture

    A distributed solution has several advantageous respect to a centralizedsolution from the points of view of scalability and fault-tolerance. In ourapproach, distribution is obtained from a multiagent system, where each ca-mera is represented and managed by an individual autonomous software agent(surveillance-sensor agent). Each surveillance-sensor agent knows only part ofthe information (partial knowledge), and has to take decisions with this lim-itation. The distributed nature of this type of systems supports the proacti-vity of surveillance-sensor agents, additionally the cooperation required amongthem to accomplish the surveillance task justify the sociability of surveillance-sensor agents. The intelligence produced by the symbolic internal model ofsurveillance-sensor agents is based on a deliberation about the state of theoutside world (including its past evolution), and the actions that may takeplace in the future.

    In the previous gure (Fig. 3) we show the description of the multiagentarchitecture, as we could see, there are six dierent types of agents:

    1. Surveillance-sensor agent. It tracks all the targets moving within its localeld of view and sends data to their related fusion agent. It is coordi-nated with other surveillance-sensor agents in order to improve surveil-lance quality.

    2. Fusion agent. This agent integrates the information sent from the as-sociated surveillance-sensor agents. It analyzes the situation in order tomanage the resources and to coordinate the associated surveillance-sensoragents during the fusion stage.

    3. Record agent. This type of agent belongs to a specic camera only withrecording features.

    4. Planning agent. This agent provides a scene overview. It makes inferenceson the targets and the situation.

  • 358 M.A. Patricio et al.

    5. Context agent. This type of agent provides context aware information ofthe scene.

    6. Interface agent. The input/output agent interface of the multi-agent sys-tem. It provides a graphical user interface to the end user.

    We use the Belief-Desire-Intention (BDI) model to implement the deli-beration and reasoning from the images captured from the camera. The BDImodel is one of the best known and studied models of practical reasoning [19].It is based on a philosophical model of human practical reasoning, originallydeveloped by Bratman [20]. It reduces the explanation for complex human be-havior to a motivational stance [21]. This means that the causes for actions arealways related to the human desires ignoring other facets of human motiva-tions to act. An nally, it also uses, in a consistent way, psychological conceptsthat closely correspond to the terms that humans often use to explain theirbehavior. The foundation for most implemented BDI systems is the abstractinterpreter proposed by Rao and George [19]. Although many ad hoc imple-mentations of this interpreter have been applied to several domains, recentlythe release of JADEX [22] is obtaining an increasing acceptance. JADEX fa-cilitates FIPA-ACL communications between agents, and it is widely used toimplement intelligent software agents.

    The sociability of agents presumes some kind of communication betweenagents. The most accepted agent communication schemes are those based inSpeechAct Theory (for instance, KQML and FIPA-ACL) [23], we use FIPA-ACL as communication language between the agents.

    The internal technical aspect of the fusion agent can be consulted in [24],where a coordination approach is presented. Tracking results from [24] ofthree dierent cameras from the open computer vision data set PETS2006are shown in Fig. 4.

    In next sections, we will present the application of computational intelli-gence paradigms to the improvement of the surveillance-sensor agent proce-dures.

    Fig. 4. Camera 1, camera 2, and camera 3 local tracking

  • Computational Intelligence in Visual Sensor Networks 359

    4 Optimizing the Whole Tracking Systemby Means of Evolutionary Techniques

    4.1 General Optimization Problem

    As could be seen in Fig. 1, the surveillance-sensor agent is made up of sev-eral interconnected blocks, that could be grouped in ve general subsystems:background estimation, detector, segmentation module, association block andtracking system (Fig. 5). Moreover, each of these blocks is regulated by a setof parameters.

    The good performance of all the blocks is important to obtain good nalresults. Indeed, errors made at the lower levels are very dicult to correct athigher levels [25]. That is, if an object is not detected at the rst stages of thesystem, it can not be tracked and classied at the higher levels.

    Hence, each of the blocks is regulated by a set of parameters that must beproperly adjusted for the good performance of the whole system. For example,the detector threshold x the threshold in the detector to consider a pixel asmovement, background variation or just background.

    Thus, when adjusting these control parameters, we must have a criterionto measure the good or a bad performance of the system. The core of thisprocess is the evaluation of surveillance results, dening a metric to measurethe quality of a proposed conguration [26].

    Fig. 5. Information levels in processing chain. Results of detector, segmentationand tracking module

  • 360 M.A. Patricio et al.

    Moreover, the visual system must provide the best solution for the mostgeneral set of examples. Therefore the system need to work properly under dif-ferent lighting or weather conditions and have a good performance for variousscenes in case we have a single movable camera.

    As a result of this, the set of examples used to design and train the systemmust produce a general solution of the tracking system. A small set of ex-amples can lead to adapt the over-tted parameters exactly to these specicscenarios, with the consequent loss of generality. On the contrary, randomselected examples might produce disorientation in the search. Thus, the setof data that optimize the search of the suitable parameters is dened as theideal trainer [27].

    Thus, the nal goal is the search of the most general set of parameters tocongure a whole video surveillance system and achieve an optimal perfor-mance under representative and general situations.

    4.2 Proposed Optimization Using ES

    Our approach to achieve our goal follows several steps:First of all, a set of evaluation metrics per track have been proposed to

    assess the input parameters. The core of the evaluation function uses metricsbased on ground truth to evaluate the output quality for any conguration.

    Next, we take a representative set of scenarios to train the system.Then, the nal aspect is the adjustment of these parameters. By using

    the evaluation metric mentioned above, we can apply dierent techniques toassess the control parameters in order to regulate the performance of thetracking system and subsequently optimize them. Classical techniques of op-timization such as those based on a gradient descent are poorly suitable tothese types of problems, due to the high number of local minimal presentedby the tness landscape. More appropriate techniques are those based onevolutionary algorithms (EA) such as genetic algorithms (GA) or evolutionstrategies (ES) [2830]. Particularly, we are going to select evolution strategies(ES) for this problem because they present high robustness and immunity tolocal extremes and discontinuities in tness function [3133].

    Therefore, the tool used to look for the adjustment of the parameters isthe Evolution Strategies.

    Finally, we need to propose an algorithm of generalization that allowsus to nd the most suitable set of parameters for the surveillance systemfor dierent scenarios. The generalization method consists of combining theevaluation function of each track in several ways and steps in order to built agradually more general tness function.

    The parameters that control our surveillance system and must be opti-mized in this particular application are:

    THRESHOLD: it denes if a pixel could be considered as moving targetor a variation in the background.

  • Computational Intelligence in Visual Sensor Networks 361

    MIN AREA: denes a minimum area of blob in order to reduce falsedetections due to noise.

    MARGIN GATE: an outer gate dening the permissible area for search-ing for blobs separated from the estimated rectangular box enclosing thetarget.

    MINIMUM DENSITY: the minimum density required when the blobs areconnected in order to form a bigger blob that represents a target.

    CONFLICT: this parameters decides if tracks are extrapolated or not whenthere exit overlap between tracks.

    VARIANCE ACEL: the smoothing degree of Kalman lter used in thetracker.

    MINIMUM TRACK AREA: it denes a minimum area of track in orderto reduce wrong tracks, probably containing fragments of the real targets.

    MARGIN INITIALIZATION: dening the protected areas around con-rmed tracks to avoid creation of potential tracks.

    4.3 Adjustment of Surveillance System: Evaluationand Generalization

    The performance evaluation calculates some numerical values by means ofa set of proposed metrics, based on the ground truth. This ground truth isthe result of a study from pre-recorded video sequences and a subsequentprocess in which a human operator selects coordinates for each target [34].The coordinates of the targets are selected frame by frame; they are markedand bounded with rectangles, taking the upper-left and lower-right corners aslocation of target objectives.

    The evaluation system computes four parameters per target which areclassied into accuracy metrics and continuity metrics (Fig. 6):

    Accuracy Metrics:1. Overlap-area (OAP). Overlap area (in percentage) between the real

    target and the detected track.2. X-error (Ex) and Y-error (Yx). Dierence, in x and y coordinates,

    between their centers. Continuity Metrics:

    1. Number of tracks per target (NT). It is checked if more than one de-tected track is matched with the same ideal track. If this happens, theprogram keeps the detected track which has a bigger overlapped areavalue, removes the other one and marks the frame with a ag thatindicates the number of detected tracks associated to this ideal one.

    2. Commutation (C). A commutation occurs when the identier of a trackmatched to an ideal track changes. It typically takes place when thetrack is lost and recovered later.

  • 362 M.A. Patricio et al.

    Fig. 6. Evaluation system

    Fig. 7. Example of mismatched track. There are three tracks and only two targets

    The evaluation function is based on the previous metrics, by means of aweighted sum of dierent terms which are computed for each target i in ascenario j:

    ei,j =W1M



    (1OAP ) + W3

    EX + W4


    +W5OC + W6UC + W7



    with the terms dened as follows:

    Mismatch (M): A counter which stores how many times the ground truthand the tracked object data do not match up (Refer to Fig. 7).

    The three next terms are the total sum of the non-overlapped areas(

    (1OAP )) and the central error of x (EX) and y axes (EY ). The next two elements are two counters:

    Overmatch-counter (OC): how many times the ground truth track ismatched with more than one track object data.

  • Computational Intelligence in Visual Sensor Networks 363

    Undermatch-counter (UC ): how many times the ground track is notmatched with any track at all.

    The number of commutations in the track under study (C). The continuity elements are normalized by the time length of track, T ,

    while the accuracy terms are normalized by the time length of track beingcontinuous, CT (i.e. when they can be computed).

    W1,2,3,4,5,6,7 are the relative weights for the terms. Highest values havebeen given to the continuity terms, since this aspect is the key to guaranteethe global viability.

    In order to carry out a general evaluation (algorithm of generalization)over dierent targets and cases, aggregation operators must be applied overpartial evaluations. The initial or basic function is this evaluation functionper target (or track), where xi,j is the vector of metrics and is the vector ofparameters to optimize.

    ei,j = f(xi,j , ), (2)

    Thus, the extension of the evaluation function must allow assessing simul-taneously:

    One or various targets per scenario: Scenario j: {e1,j , e2,j , ...,eNj ,j} Various scenarios with several targets per scenario: M Scenarios: {e1,1,

    e2,1, ...,eN1,1, ..., e1,j , e2,j , ...,eNj ,j ,..., e1,M , e2,M , ...,eNM ,M}Two aggregation operators have be analysed:

    Sum:Ej =


    ei,j , (3)

    E =i


    ei,j . (4)

    Maximum (or Minimax):

    Ej = maxi(ei,j), (5)

    E = maxi(maxj(ei,j)). (6)

    In [35,36], the authors showed that a signicant improvement of the globalvision system is achieved, in terms of accuracy and robustness. With thismethodology based on the optimization, the inter-relation of parameters atdierent levels allows a coherent behavior under dierent situations. A gener-alization analysis has shown the capability to overcome the over-adaptationwhen particular cases are considered and a continuous improvement whenadditional samples are aggregated in the training process, comparing two dif-ferent operators: sum and worst-case aggregation.

    To demonstrate the validation of our methodology, we compare our track-ing system, tuned after the generalization process, against some existing

  • 364 M.A. Patricio et al.

    methods. All the next tracking systems are available in the open softwareof [37,38].

    CC (Connected Component Tracking) [39]. MS (Mean Shift Tracking) or Kernel Based Tracking [40,41]. MSPF (Particle Filter based on MS weight) [42]. CCMSPF (Connected Component tracking and MSPF resolver for colli-

    sion) [39]. MSFG (Mean Shift Tracking with FG mask using) [43]. CGA (Association by Compact Genetic Algorithm) [44].

    The training videos that we have used for the experiments consist of aset of three types of scenarios in an airport domain [36]. The scenarios rep-resent a good set for training the system as they are long and varied enoughto cover the most common situations of surface movements of aircraft andcars in the roads of an airport. The rst video includes ve targets, four carsand luggage vehicles (T1, T2, T3 and T4) and a big airplane (T5). The sec-ond and third sequences have three aircraft (T1, T2 and T3). The secondscenario is a not dicult situation where there are three aircraft that canbe tracked very easily. Moreover, we use a simple tracking system based onrules [36]. The experiments are carried out over this simple tracking systemfollowing the methodology of adjustment and generalization explained beforeand the two aggregation functions: Minimax (Experiment I Rules I) andSum (Experiment II Rules II).

    As it can be checked in Table 1 and we have pointed before, our methodis good for generalization, obtaining a similar performance for all the cases.The classiers CCMSPF, CC, MS and MSFG have a brilliant behaviour inthe second scenario, the easiest one to analyse since there are only three bigaircraft and no cars or buses. Nevertheless, all the new trackers present abad performance when tracking the more dicult scenarios, in which thereare big aircraft and small moving vehicles. Thus, we can check how our opti-mized tracking system has a performance between 11,000 and 14,564 for thesedicult cases, whereas the rest of systems present much higher values.

    As a result of this, we can conclude that the optimization give us a tradeo to have similar performance in all the scenarios we have trained. We obtaina set of parameters that provide good performance for dierent scenarios inan airport environment. In addition, we could highlight that good results areobtained with a very simple tracker after tuning it by means of the optimiza-tion methodology that we propose. On the other hand, more sophisticatedtrackers give good performance for easy scenarios (Video 2), whereas theycannot make it so good for dicult situation where aircraft and little movingvehicles share the taxi-road (Video 1).

  • Computational Intelligence in Visual Sensor Networks 365


















































































































































































  • 366 M.A. Patricio et al.

    5 Computational Intelligence Paradigms for Video DataAssociation

    5.1 Video Data Association Problem Denition

    Video data association is a blob-to-track multi-assignment problem. Several(or none) blobs could be assigned to the same track and simultaneously severaltracks could overlap and share common blobs. This can be formalized throughthe assignment binary matrix, A[k], dened as Aij [k] = 1 if blob bi[k] isassigned to object oj ; and Aij [k] = 0 otherwise.

    The blobs extracted in the kth frame are b[k] = {b1[k], ..., bNk [k]} and theobjects tracked up to now are o[k 1] = {o1[k], ..., oMk [k 1]}. The size ofmatrix A[k], NkxMk, changes with time since the number of blobs extracteddepends on variable eects during image processing, and the number of objectsalso changes.

    In many applications, a basic metric used for data association is theobservations-to-tracks distance, dij, computed through the Mahalanobis for-mula [12]:

    dij =[xj fi

    ]t (P1j ) [xj fi] ; i = 1, ..., Nk, j = 1, ... ,Mk. (7)The features vector fi are extracted from the sets of blobs corresponding

    to jth track (bi[k] such that Aij [k]=1). The estimated state vectors, xj , withstate information and associated covariance Pj , are recursively updated withassigned observations by means of a Kalman lter. In these approaches, theoptimal decision would be the combination for A[k] such that the sum ofdistance between assigned blobs and tracks is minimized:


    i=1,j=1 dij .


    The number of possible solutions for Boolean matrix A is 2NkMk , so itis generally impractical to nd the optimal decision through exhaustive enu-meration of all association hypotheses. Furthermore, it could be even uselesssince this metric can be an oversimplication of the real problem.

    5.2 Fuzzy Association

    The method proposed here, detailed in [45,46], uses a fuzzy system to analyzeinteracting blobs and tracks. It computes condence levels that are used toweight each gated blobs contribution to update the target track, includinglocation and shape. Domain knowledge, represented as rules to compute theweighs, is extracted from predened situations (examples) to carry out aninductive generalization process covering all intermediate cases. This pro-cedure is based on a simplied association technique (analogous to a JPDA

  • Computational Intelligence in Visual Sensor Networks 367

    Fig. 8. Fuzzy concepts used for video association

    approach), but complemented with a knowledge-based system to ponder theweights of blobs under uncertain conditions and so solve situations of highcomplexity.

    An explicit representation of target shape and dimensions is used in theassociation logic to select the set of updating blobs for each track. The weightsof gated blobs are based on numeric heuristics (descriptors), computed witha simple geometrical analysis. They have been detailed in [45, 47] and aresummarized next (see Fig. 8):

    Overlap. A soft gating, computed as the fraction of blob area containedwithin track predicted region.

    Density. It evaluates the ratio between areas of detected regions and non-detected zones (holes) in the box enclosing the reconnected set of blobs.A low value will indicate that dierent targets probably have originatedthem.

    Conict. This component evaluates the likelihood of blob being in conictwith other tracks. This problem appears when target trajectories are soclose that track gates get overlapped and share the blob.

    Coverage. The condence on predicted track is characterized with thisheuristic to assess the condence given to the fact that this track representsmotion of a real target. It is dened by the percentage of predicted areacovered by blobs corresponding to detected targets.

    The previous heuristics are the input to relations indicating the condencelevels both for blobs and predicted tracks in the update process. A rulebaseapproximates these relations. The detailed description of heuristics, translatedinto linguistic variables, sets, and rules appears in [45,46,48].

  • 368 M.A. Patricio et al.

    Target estimated shape is restricted to vary smoothly, accordingly to com-puted weights. The estimated position depends both on these blobs and trackscondence levels. Estimated shape (dimensions of box) is the most constrainedfeature, remaining frozen while the blobs condence levels are not highenough, while estimated position is a trade-o between condence levels esti-mated both for blobs and tracks. For instance, in horizontal coordinate, thetwo gated blobs with the minimum and maximum extremes for coordinatex, (xbmin, xbmax) are taken into account. The target shape, lH [k], is updatedconsidering the minimum blob condence value, minH , and the last valueestimated for last frame, lH [k 1]:

    lH [k] = minH(xbmax xbmin) + (1 minH)lH [k 1]. (9)So, the estimated target length (and width) is modied only if all blobs

    have enough condence. Otherwise, in the case that at least one blob has lowcondence (for instance during a multi-track conict), the length and widthare maintained constant until full condence is recovered. The estimated tar-get bounds (location of box) are updated close to the blob with the highestcondence, maxH , considering also the value for track condence. For in-stance, if left-hand side blob dening vale xbmin had the highest condence,target bounds would be updated taking the bound dened by this blob andthe value predicted since last update, xmin[k 1]:

    xmin[k] = maxHxbmin + (1 maxH)(xmin[k 1] + vx[k 1]T ), (10)xmax[k] = xmin[k] + lH [k]. (11)

    Figure 9 shows an example of track shape update with two targets overlap-ping while they cross. Due to the conicting blob, the rule to lock dimensionsis applied. Bounds are computed to conform to the conict-free blobs (withhigh condence levels for association). So, the eect of occlusion is minimized.

    Furthermore the representation of expert criteria in the rule base, learningtechniques were exploited to automatically learn and tune the proposed sys-tem. A machine-learning procedure (neuro-fuzzy technique) has been appliedin order to extract rules directly from examples, analyzing the capability toprovide right decisions in dierent conditions. This automatic procedure wasapplied as an alternative to tune the labels membership functions of linguisticvariables used to represent the knowledge.

    In our work [47, 49], the fuzzy system for association used Mamdami im-plication. The Nauck/Kruse neuro-fuzzy approach was applied using directlythis type of implementation for implication operator.

    Three dierent fuzzy systems have been tested and compared with a con-ventional data association system. Rules for the rst one were obtained usingexpert knowledge, the second integrated ruled learned from pre-classied ex-amples. The rigid scheme with hardwired decisions was taken as a bench-mark, and compared with the fuzzy systems, considering the three variantsof rule sets mentioned above. This analysis was performed on representativescenarios processed to obtain and store the reference ground truth [49].

  • Computational Intelligence in Visual Sensor Networks 369

    Fig. 9. Shape update during conict

    5.3 Video Association Using Estimation of DistributionsAlgorithms (EDAs)

    Evolutionary algorithms (EAs) have demonstrated to be eective search tech-niques of general purpose. One of the main problems with EAs is the adjust-ment of its parameters, in special those related to the crossover and mutationoperators. Recently has appeared a new family of algorithms that bases theirbehavior on the statistical modeling of the genetic algorithms.

    The Estimation of Distribution Algorithms (EDAs), [50] replace the useof an evolving population by a vector that directly codies the joint probabil-ity distribution of vectors corresponding to the best solutions. The crossoverand mutation operators are replaced by rules that update the probability dis-tribution. A great advantage of the EDAs on the evolutionary algorithms isthat they allow expressing the interactions between variables of the problemby means of the associated joint probability distribution. In addition, theyimprove the time of convergence and the necessary space of memory for itsoperation.

    EDAs present the suitable features to deal with problems requiring a veryecient search: small populations and a few iterations, compared with themore classic approaches to evolutionary algorithms (EAs). The fundamentaldierence of EDAs with classical EAs is that the formers carry out a search

  • 370 M.A. Patricio et al.

    of the probability distribution describing the optimal solutions while EAs di-rectly make the search and provide the solutions to the problem with thesolutions itself. They share the necessity of codication of solutions by meansof binary chains, in the EA terminology they are the individuals and thedenition of a merit measurement that allows to orient the search direction,the so called tness function. In the case of EDAs, operators to manipu-late individuals in the search, such as mutation, selection, and crossover, isnot needed, since the search is performed directly on the distribution whichdescribes all possible individuals.

    The high level algorithm of the EDA and EA are compared in the followingpseudocodes.EDA:

    1. Generate a population randomly2. Select a set of tter individuals3. Estimate a probabilistic model over tter individuals4. Obtain a new set of individuals by means of sampling the probabilistic

    model5. Incorporate the new set into population6. If the termination criteria is not satised, go to 2


    1. Generate a population randomly2. Select a set of tter individuals3. Obtain a new set of individual by means of applying crossover and muta-

    tion operator4. Incorporate the new set into population5. If the termination criteria is not satised, go to 2

    The key point of the use of EDAs is in the estimation of the joint probabil-ity distribution. The simplest situation is that in which the joint probabilitydistribution factorizes as a product of univariate and independent distribu-tions, that is to say, there is no dependency between the variables. In this situ-ation the estimation of the probability distribution is made using the marginalfrequencies. Based on the dependencies between the variables, a classicationof the EDA is made. The simplest model considers independence betweenthe variables, UMDA [51], PBIL [52], CGA [53] are algorithms characteristicof this type. The MIMIC [54] algorithm incorporates bivariate dependenciesand some examples of model for multiple dependencies are FDA [55]. Thisalgorithm uses a Bayesian network as probabilistic model, this characteristicconfers a great capacity of representing dependencies but the computationalcost is very high.

    Application of EDAs to Video Association Problem

    The association problem has been dened as a search over possible blob as-signments. This problem could be dened as minimizing an heuristic function

  • Computational Intelligence in Visual Sensor Networks 371

    to evaluate blob assignments by an ecient algorithm (Estimation of Distrib-ution Algorithm). The heuristic function takes a Bayesian approach to modelthe errors in observations. The formulation of data association as a minimiza-tion problem solved by a genetic technique is not a handicap with respectto the required operation in real time. A worst-case number of operationscan be xed and bound the time consumed by the algorithm, if we restrictthe maximum number of evaluations. Then, given a certain population size,the algorithm will run a number of generations limited by this bound on thenumber of evaluations. The most important aspect is that the EDA shouldconverge to acceptable solutions with these conditions of limited populationsize and number of generations.

    Heuristic to Evaluate Assignments

    The description of the heuristic of the search, that determines the quality ofthe solutions and guides the search toward the optimal one, is shown in thissection. An extended distance is used as evaluation function for groups ofdetected blobs assigned to tracks according to matrix A (A represents eachhypothesis to be evaluated). The heuristic is aimed at providing a measureof probability density of assigned observations to tracks. This likelihood func-tion considers several types of terms and their probabilistic characterization:the separation between tracks and centroids of groups of blobs, the similar-ity between track-smoothed target attributes and those extracted from blobgroups, and the events related with erasing existing tracks and creating newones. As mentioned in the introduction, the nal objective is to achieve agood trade-o between capability to re-connect image regions, keeping a sin-gle track per target, and avoid at the same time the miss-assignment of blobscoming from dierent objects or from extraneous sources.

    The extended distance allows the evaluation of a certain hypothesis forgrouping blobs in sets and assigning them to tracks. The term consideringcentroid residual, typically used in other approaches, is enriched with termsfor attributes to take into account the available structural characteristics oftargets which can be extracted from data. There are also terms consideringthat hypotheses may label some blobs as false alarms or may leave conrmedtracks with no updating blobs:

    log(P (b[k]|A[k], x1,...,M [k 1])) = log jth,track




    log DGroupTrack(j)





    If we denote the blobs assigned to jth track as

    Group(i) Track(j) = {bi[k]|Aij [k] = 1} (13)

  • 372 M.A. Patricio et al.

    dij = dGroup(i)Track(j) = dCentroid(i,j) + dAttributes(i,j) + dPD(i,j) + dPFA(i)

    (14)where sub-indices i, j refer to the ith group of blobs and jth track:

    dCentroid(i,j): it is the normalized residual between jth track predictionand centroid of the assigned group of blobs under ith hypothesis.

    dAttributes(i,j): it is the normalized residual between track attributes andthose extracted from the group. Its value is given, assuming Gaussiandistribution and attribute independence.

    dPD(i,j): assesses the cost of no updating a conrmed track for those hy-potheses in which no blob is assigned to jth track. It considers the proba-bility of updating each track.

    dPFA(i): assesses the cost of labeling a blob as a false alarm, also assuminga certain probability of false alarm, PFA.

    Encoding and Ecient Search with EDA Algorithms

    The association consists of nding the appropriate values for assignment ma-trix A, where element A(i, j) is 1 if blob i is assigned to object j and 0 inthe opposite case. In order to be able to use the techniques of evolutionaryalgorithms, the matrix A is codied in a string of bits, being the size of matrixA NM, with N the number of extracted blobs and M the number of objectsin the scene. A rst possibility for problem encoding was tried with a stringof integer numbers representing the possible M objects to be assigned for eachblob, including the null track 0, as shown in Fig. 10.

    This encoding requires strings of Nlog2(1 + M) bits and has the problemof constraining search to solutions in which each blob can belong to one objectat most. This could be a problem in situations where images from dierentobjects get overlapped and may leave some tracks unassigned and lost. Then,a direct encoding of A matrix was used for general solutions, where the po-sitions in the string represent the assignations of blobs to tracks. With thiscodication, where individuals need N(1+M) bits, a blob can be assigned toseveral objects, see Fig. 11.

    Finally, in order to allow and eective search, the initial individuals arenot randomly generated but they are xed to solutions in which each blob isassigned to the closest object. So, the search is performed over combinationsstarting from this solution in order to optimize the heuristic after changing

    Fig. 10. Simple encoding for blob assignment

  • Computational Intelligence in Visual Sensor Networks 373

    Fig. 11. Direct encoding for whole A matrix

    Fig. 12. Application of EDAs to maritime scenes

    any of this initial conguration. Besides, for the case of EDA algorithms, thevector probabilities are constrained to be zero for the case of very far pairs andthose blobs which fall in spatial gates of more than one track have a non-zerochange probability.

    In [56], authors present the application of EDAs to track boats in maritimescenes. This problem is a challenging problem due to the complex segmenta-tion of these images. The sea has continuous movement, which contributes tothe creation of a great amount of noisy blobs (see Fig. 12).

    6 Conclusions

    Some approaches based on computational intelligence have been applied todevelop video and multimedia processing systems in visual sensor networks.The knowledge about the domain is exploited in the form of fuzzy rules for

  • 374 M.A. Patricio et al.

    data association and heuristic evaluation functions to optimize the design andguide the search of appropriate decisions. The results, referring to dierentworks and mainly obtained with evaluation metrics based on ground truth,showed that these strategies result in competitive solutions in the context oftheir application domains. Furthermore, the proposed multi-agent architec-ture for cooperative operation will allow gain in scalability when deployingthe system to cover wide areas.


    1. F. Castanedo, M. A. Patricio, J. Garcia, and J. M. Molina. Extending surveil-lance systems capabilities using bdi cooperative sensor agents. In VSSN 06:Proceedings of the Fourth ACM International Workshop on Video Surveillanceand Sensor Networks, pages 131138, New York, NY, USA, 2006. ACM Press.

    2. R. Cucchiara. Multimedia surveillance systems. In VSSN 05: Proceedings of theThird ACM International Workshop on Video Surveillance & Sensor Networks,pages 310, New York, NY, USA, 2005. ACM Press.

    3. M. A. Patricio, J. Carbo, O. Perez, J. Garca, and J. M. Molina. Multi-agentframework in visual sensor networks. EURASIP Journal on Advances in SignalProcessing, 2007:Article ID 98639, 21 pages, 2007. doi:10.1155/2007/98639.

    4. R. T. Collins, A. J. Lipton, H. Fujiyoshi, and T. Kanade. Algorithms for coop-erative multisensor surveillance. In Proceedings of the IEEE, volume 89, IEEE,October 2001.

    5. C. S. Regazzoni, V. Ramesh, and G. L. Foresti. Special issue on video communi-cations, processing, and understanding for third generation surveillance systems.In Proceedings of the IEEE, volume 89, October 2001.

    6. B. P. L. Lo, J. Sun, and S. A. Velastin. Fusing visual and audio information ina distributed intelligent surveillance system for public transport systems. ActaAutomatica Sinica, 29(3):393407, 2003.

    7. X. Yuan, Z. Sun, Y. Varol, and G. Bebis. A distributed visual surveillancesystem. In IEEE Conference on Advanced Video and Signal based Surveillance,pages 199205, Florida, 2003.

    8. M. Valera and S.A. Velastin. Intelligent distributed surveillance systems: a re-view, 152:192204, April 2005.

    9. M. Wooldridge and N. Jennings. Intelligent agents: theory and practice. Theknowledge Engineering Review, 1995.

    10. O. Perez, M. A. Patricio, J. Garca, and J. M. Molina. Improving the segmen-tation stage of a pedestrian tracking video-based system by means of evolutionstrategies. In Eigth European Workshop on Evolutionary Computation in ImageAnalysis and Signal Processing. EvoIASP 2006, Budapest, Hungary, April 2006.

    11. E. Y. Kim and S. H. Park. Automatic video segmentation using genetic algo-rithms. Pattern Recoginition Letters, 27(11):12521265, 2006.

    12. Samuel S. Blackman and R. Popoli. Design and Analysis of Modern TrackingSystems. Artech House, Inc., 1999.

    13. D. L. Hall and J. Llinas. Handbook of MultiSensor Data Fusion. CRC Press,Boca Raton, 2001.

  • Computational Intelligence in Visual Sensor Networks 375

    14. Ingemar J. Cox and Sunita L. Hingorani. An ecient implementation of reidsmultiple hypothesis tracking algorithm and its evaluation for the purpose of vi-sual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,18(2):138150, 1996.

    15. K. Pattipati, S. Deb, and Y. Bar-Shalom. A new relaxation algorithm and pas-sive sensor data association. IEEE Transactions on Automatic Control, 37:198213, 1992.

    16. Y. Ruan and P. Willett. Multiple model pmht and its application to the bench-mark radar tracking problem. IEEE Transactions on Aerospace and ElectronicSystems, 40(4):13371350, October 2004.

    17. I. Haritaoglu, D. Harwood, and L. S. David. W4: Real-time surveillance of peo-ple and their activities. IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(8):809830, 2000.

    18. O. Perez, M. A. Patricio, J. Garcia, and J. M. Molina. Fusion of surveillanceinformation for visual sensor networks. In Proceedings of the Ninth InternationalConference on Information Fusion, Florence (Italy), July 2006.

    19. A. Rao and M. George. Bdi agents: from theory to practice. In Proceedings ofthe First International Conference on Multi-Agent Systems (ICMAS95), pages312319, Cambridge, MA, USA, 1995. The MIT Press, Cambridge, MA.

    20. M. E. Bratman. Intentions, Plans and Practical Reasoning. Harvard UniversityPress, Cambridge, MA, 1987.

    21. D. Dennett. The Intentional Stance. Bradford Books, 1987.22. A. Pokahr, L. Braubach, and W. Lamersdorf. Jadex: Implementing a bdi in-

    fraestructure for jade agents. Search of Innovation (Special Issue on JADE),3(3):7685, September 2003.

    23. Y. Labrou, T. Finin, and Y. Peng. Agent communication languages: The currentlandscape. IEEE Intelligent Systems, 14(2):4552, 1999.

    24. F. Castanedo, M. A. Patricio, J. Garcia, and J. M. Molina. Bottom-up/top-down coordination un a multiagent visual sensor network. In 2007 IEEE Con-ference on Advanced Video and Signal Based Surveillance (AVSS 2007). IEEEComputer Society, 2007.

    25. P. J. Withagen. Object detection and segmentation for visual surveillance. ASCIdissertation series number 120, Advanced School for Computing and Imaging(ASCI), Delft University of Technology, 2005.

    26. P. Lobato Correia and F. Pereira. Objective evaluation of video segmentationquality. IEEE Transactions on Image Processing, 12(2):186200, 2003.

    27. B. W. Wah. Generalization and generalizability measures. In IEEE Transactionon Knowledge and Data Engineering, volume 11, pages 175186, 1999.

    28. I. Rechenberg. Evolutionsstrategie. Friedrich Fromman Verlag, Stuttgart,Germany, 1973.

    29. I. Rechenberg. Evolutionsstrategie94. Friedrich Fromman Verlag, Stuttgart,Germany, 1994.

    30. Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies? A comprehen-sive introduction. Springer, Netherlands, 2004.

    31. T. Back. Evolutionary Algorithms in Theory and Practice. Oxford UniversityPress, New York, 1996.

    32. D. B. Fogel, T. Back and Z. Michalewicz. Evolutionary Computation: AdvancedAlgorithms and Operators. Institute of Physics, London, 2000.

    33. D. B. Fogel, T. Back and Z. Michalewicz. Evolutionary Computation: BasicAlgorithms and Operators. Institute of Physics, London, 2000.

  • 376 M.A. Patricio et al.

    34. D. Doermann and D. Mihalcik. Tools and techniques for video performance eval-uation. In Proceedings of the International Conference on Pattern Recognition(IPCER00), pages 41674170, Barcelona, Spain, September 2000.

    35. J. Garcia, J. A. Besada, A. Berlanga, J. M. Molina, G. de Miguel, and J. R.Casar. Application of evolution strategies to the design of tracking lters witha large number of specications. 8:766779, 2003.

    36. O. Perez, J. Garca, A. Berlanga, and J. M. Molina. Evolving parameters ofsurveillance video systems for non-overtted learning. Proceedings of the SeventhEuropean Workshop on Evolutionary Computation in Image Analysis and SignalProcessing (EvoIASP05), pages 386395, 2005.

    37. OpenCV. intel.com/technology/computing/opencv/index.htm, 2007.38. T. P. Chen, H. Haussecker, A. Bovyrin, R. Belenov, K. Rodyushkin, A. Kura-

    nov, and V. Eruhimov. Computer vision workload analysis: Case study of videosurveillance systems. 9(2):109118, May 2005.

    39. D. da Silva Pires, R. M. Cesar-Jr, M. B. Vieira, and L. Velho. Trackingand Matching Connected Components from 3D Video. Proceedings of theXVIII Brazilian Symposium on Computer Graphics and Image Processing (SIB-GRAPI05), 05, 2005.

    40. D. Comaniciu and P. Meer. Mean shift analysis and applications. ComputerVision, 1999. The Proceedings of the Seventh IEEE International Conferenceon, 2, 1999.

    41. D. Comaniciu and V. Ramesh. Real-time tracking of non-rigid objects usingmean shift, July 8 2003. US Patent 6,590,999.

    42. B. Zhang, W. Tian, and Z. Jin. Joint tracking algorithm using particle lterand mean shift with target model updating. Chinese Optics Letters, 4:569572,2006.

    43. L. Li, W. Huang, I. Y. H. Gu, and Q. Tian. Statistical modeling of complexbackgrounds for foreground object detection. Image Processing, IEEE Transac-tions on, 13(11):14591472, 2004.

    44. F. Cupertino, E. Mininno, and D. Naso. Elitist Compact Genetic Algorithmsfor Induction Motor Self-tuning Control. Evolutionary Computation, 2006. CEC2006. IEEE Congress on, pages 30573063, 2006.

    45. J. Garca, J. M. Molina, J. A. Besada, and J. I. Portillo. A multitarget trackingvideo system based on fuzzy and neuro-fuzzy techniques. EURASIP Journal onApplied Signal Processing, 14:23412358, 2005.

    46. J. Garca, J. A. Besada, J. M. Molina, J. Portillo, and J. R. Casar. Robustobject tracking with fuzzy shape estimation. In FUSION 02: Proceedings ofthe International Conference on Information Fusion, Washington, DC, USA,2002. IEEE ISIF.

    47. J. M. Molina, J. Garca, O. Perez, J. Carbo, A. Berlanga, and J. Portillo. Ap-plying fuzzy logic in video surveillance systems. Mathware and Soft Computing,12(3):185198, 2005.

    48. J. Garca, J. A. Besada, J. M. Molina, J. I. Portillo, and G. de Miguel. Fuzzydata association for image-based tracking in dense scenarios. In David B. Fogel,Mohamed A. El-Sharkawi, Xin Yao, Garry Greenwood, Hitoshi Iba, PaulMarrow, and Mark Shackleton, editors, Proceedings of the 2002 Congress onEvolutionary Computation CEC2002. IEEE Press, 2002.

    49. J. Garca, O. Perez, A. Berlanga, and J. M. Molina. An evaluation metric foradjusting parameters of surveillance video systems, chapter in Computer Visionand Robotics. Nova Science Publishers, 2004.

  • Computational Intelligence in Visual Sensor Networks 377

    50. P. Larraniaga and J. A. Lozano. Estimation of Distribution Algorithms: A NewTool for Evolutionary Computation. Kluwer, Norwell, MA, USA, 2001.

    51. H. Muhlenbein. The equation for response to selection and its use for prediction.Evolutionary Computation, 5(3):303346, 1997.

    52. S. Baluja. Population-based incremental learning: A method for integratinggenetic search based function optimization and competitive learning,. TechnicalReport CMU-CS-94-163, CMU-CS, Pittsburgh, PA, 1994.

    53. G. R. Harik, F. G. Lobo, and D. E. Goldberg. The compact genetic algorithm.IEEE-EC, 3(4):287, November 1999.

    54. Jeremy S. de Bonet, Charles L. Isbell, Jr., and Paul Viola. MIMIC: Findingoptima by estimating probability densities. In Michael C. Mozer, Michael I.Jordan, and Thomas Petsche, editors, Advances in Neural Information Process-ing Systems, volume 9, page 424. The MIT Press, Cambridge, MA, 1997.

    55. H. Muhlenbein and T. Mahnig. The factorized distribution algorithm for addi-tively decompressed functions. In 1999 Congress on Evolutionary Computation,pages 752759, Piscataway, NJ, 1999. IEEE Service Center.

    56. M. A. Patricio, J. Garca, A. Berlanga, and J. M. Molina. Video tracking associ-ation problem using estimation of distribution algorithms in complex scenes. InArticial Intelligence and Knowledge Engineering Applications: A BioinspiredApproach: First International Work-Conference on the Interplay Between Nat-ural and Articial Computation, Lecture Notes in Computer Science. SpringerBerlin Heidelberg New York, 2007.

  • Scalability and Evaluation of ContextualImmune Model for Web Mining

    Slawomir T. Wierzchon1,2, Krzysztof Ciesielski1, and MieczyslawA. Klopotek1,3

    1 Institute of Computer Science, Polish Academy of Sciences,Ordona 21, 01-237 Warszawa, Polandstw,kciesiel,klopotek@ipipan.waw.pl

    2 Faculty of Mathematics, Physics and Informatics, Gdansk University,Wita Stwosza 57, 80-952 Gdansk-Oliwa

    3 Institute of Computer Science, University of Podlasie,Konarskiego 2, 08-110 Siedlce

    Summary. In this chapter we focus on some problems concerning application ofan immune-based algorithm to extraction and visualization of cluster structure.Particularly a hierarchical, topic-sensitive approach is proposed; it appears to be arobust solution to the problem of scalability of document map generation process(both in terms of time and space complexity). This approach relies upon extractionof a hierarchy of concepts, i.e. almost homogenous groups of documents describedby unique sets of terms. To represent the content of each context a modied versionthe aiNet [9] algorithm is employed; it was chosen because of its natural abilityto represent internal patterns existing in a training set. Careful evaluation of theeectiveness of the novel text clustering procedure is presented in section reportingexperiments.

    1 Introduction

    When analyzing the number of terms per query in one billion accesses tothe Altavista site, [12], extraordinary results were observed by Alan Gilchrist:(a) in 20.6% queries no term was entered, (b) in almost 25% queries only oneterm was used in a search, and (c) the average was not much higher thantwo terms! This justies our interest in looking for a more user-friendlyinterfaces to web-browsers.

    A rst stage in improving the eectiveness of Information Retrieval (IR)systems was to apply the idea of clustering inspired by earlier studies ofSalton, [21], and reinvigorated by Rijsbergens Cluster Hypothesis [24]. Ac-cording to this hypothesis, relevant documents tend to be highly similar toeach other, and therefore tend to appear in the same clusters. Thus, it is pos-sible to reduce the number of documents that need to be compared to a givenS.T. Wierzchon et al.: Scalability and Evaluation of Contextual Immune Model for Web Mining,

    Studies in Computational Intelligence (SCI) 96, 379408 (2008)

    www.springerlink.com c Springer-Verlag Berlin Heidelberg 2008

  • 380 S.T. Wierzchon et al.

    query, as it suces to match the query against cluster representatives rst.However such an approach oers only technical improvement in searching rele-vant documents. A more radical improvement can be gained by using so-calleddocument maps [2], where a graphical representation allows additionally toconvey information about the relationships of individual documents or groupof documents. Document maps are primarily oriented towards visualizationof a certain similarity of a collection of documents, although other usage ofsuch the maps is possible consult Chap. 5 in [2] for details.

    The most prominent representative of this direction is the WEBSOMproject1. Here the self-organizing map, or SOM, algorithm [19] is used toorganize miscellaneous text documents onto a two-dimensional grid so thatrelated documents appear close to each other. Each grid unit contains a setof closely related documents. The color intensity reects dissimilarity amongneighboring units: the lighter shade the more similar neighboring units are.Unfortunately this approach is time and space consuming, and rises questionsof scaling and updating of document maps (although some improvements arereported in [20]). To overcome some of these problems the DocMINER systemwas proposed in [2]. It composes a number of methods from explorative dataanalysis to support eectively information access for knowledge managementtasks. Particularly, a given collection of documents represented as vectors inhighly dimensional vector space is moved by a multidimensional scaling algo-rithm to so-called semantic document space in which document similaritiesare reinforced. Then the topological structure of the semantic document spaceis mapped to a two-dimensional grid using the SOM algorithm.

    Still, the profound problem of map-like representation of document collec-tions is the issue of scalability which is strongly related to high dimensionality.While multidimensional scaling and other specialized techniques, like PCA,versions of SVD, etc. reduce the dimensionality of the space formally, theymay result in increased complexity of document representation (which had alow number of non-zero coordinates in the high-dimensional space, and hasmore non-zero coordinates in the reduced space). So some other way of dimen-sionality reduction, via feature selection and not feature construction, shouldbe pursued.

    Note that the map of documents collection is a new kind of clustering,where not only the documents are split into groups, but also there exists astructural relationship between clusters, reected by the topology of a map.We can say we have to do with a cluster networking. This aects the closelyrelated issue of evaluation of the quality of the obtained clusters. Usually thequality evaluation function is a driving factor behind the clustering algorithmand hence partially determines its complexity and success. While the conven-tional external and internal cluster evaluations criteria (like class purity, classuniformity, inter-class dissimilarity) are abundant, they are primarily devised

    1 Details and full bibliography concerning WEBSOM can be found at the web-pagehttp://websom.hut.fi/websom/.

  • Scalability and Evaluation of Contextual Immune Model for Web Mining 381

    to evaluate the sets of independent (not linked) clusters, there exist no sat-isfactory evaluation criteria for cluster network quality. Beside SOM, thereare other clustering methods like growing neural gas (GNG) [11] or articialimmune systems (AIS) [9, 25] that face similar problems.

    In our research project BEATCA [18], oriented towards exploration andnavigation in large collections of documents a fully-edged search engine capa-ble of representing on-line replies to queries in graphical form on a documentmap has been designed and constructed [16]. A number of machine-learningtechniques, like fast algorithm for Bayesian networks construction [18], SVDanalysis, (GNG) [11], SOM algorithm, etc. have been employed to realizethe project. BEATCA extends the main goals of WEBSOM by a multilingualapproach, new forms of geometrical representation (besides rectangular maps,projections onto sphere and torus surface are possible).

    The process of document map creation is rather complicated and consistsof the following main stages: (1) document crawling, (2) indexing, (3) topicsidentication, (4) document grouping, (5) group-to-map transformation, (6)map region identication, (7) group and region labeling, and nally, (8) visu-alization. At each of theses stages various decisions should be made implyingdierent views of the document map.

    Within such a framework, in this chapter we propose a new solution to theproblem of scalability and of evaluation of the quality of the cluster network.In particular, the contribution of this chapter concerns: (1) invention of a newarticial immune algorithm for handling large-scale document collections, toreplace the traditional SOM in document map formation, (2) invention of anew representation of the document space, in which instead of single pointstatistics of terms their distributions (histograms) are exploited, (3) inventionof a measure of quality of networked clustering of document collections, whichis based on the above-mentioned histograms, and which evaluates the qualityof both the clustering of documents into the groups as well as usefulness ofthe inter-group links. These new features are of particular value within ourframework of contextual document space representation, described in earlierpublications, allowing for a more radical intrinsic dimensionality reduction,permitting ecient and predominantly local processing of documents.

    In Sect. 2 we present our hierarchical, topic-sensitive approach, whichappears to be a robust solution to the problem of scalability of map gen-eration process (both in terms of time complexity and space requirements).It relies upon extraction of a hierarchy of concepts, i.e. almost homogenousgroups2 of documents. Any homogenous group is called here a context, inwhich further document processing steps like computation of term-frequencyrelated measures, keyword extraction, and dimensionality reduction are car-ried out, so that each context is described by unique set of terms. To representthe content of each context a modied version of the aiNet algorithm [10] was

    2 By a homogegous group we understand hereafter a set of documents belongingto a single cluster after a clustering process.

  • 382 S.T. Wierzchon et al.

    employed see Sect. 3. This algorithm was chosen because of its ability ofrepresenting internal patterns existing in a training set. More precisely, theaiNet produces a compressed data representation for the vectors through theprocess resembling data edition. Next this reduced representation is clustered;the original aiNet algorithm uses hierarchical clustering [10], while we proposeoriginal and much more ecient procedure.

    Further, the method of representing documents and groups of documentsin the vector space was enriched: Instead of traditional single point measurewe apply the histograms of term occurrence distributions in some conceptualspace so that the document content patterns would be matched in a morerened way see Sect. 4 for details.

    To evaluate the eectiveness of the novel text clustering procedure it hasbeen compared to the aiNet and SOM algorithms in Sect. 5. In the experi-mental Sects. 5.65.8 we have also investigated issues such as evaluation ofimmune network structure and the inuence of the chosen antibody/antigenrepresentation on the resulting immune memory model. Final conclusions aregiven in Sect. 7.

    1.1 Document Maps

    Before going into details let us devote a little bit attention to the conceptof a document map as such. Formally, a document map can be understoodas a two-dimensional rectangle (or any other geometrical gure) split intodisjoint areas, usually squares or hexagons3, called cells. To each cell aset of documents is assigned, thus a single cell may be viewed as a kind ofdocument cluster. The cells are frequently clustered into so-called regions onthe ground of similarity of their content. The cells (and regions) are labeledby the keywords best-describing cell/region content, where best-describingis intended to mean entire characteristic of the cell/region, but distinguishingit from surrounding cells/regions. A document map is visualized in such a waythat cell colors (or textures) represent the number of documents it contains, orthe degree of similarity to the surrounding cells, the importance of documents(e.g. PageRank), the count of documents retrieved in the recent query, or anyother feature signicant from the point of view of the user. The labels of somecell/region are also displayed, but with a density not prohibiting the overallreadability. Optionally, labels may be displayed in mouse-over fashion.

    2 Contextual Local Networks

    In our approach like in many traditional IR systems documents are mappedinto T -dimensional term vector space. The points (documents) in this spaceare of the form (w1,d, . . . , wT,d) where T stands for the number of terms, and

    3 For non-Euclidian geometries other possibilities exist cf. [18].

  • Scalability and Evaluation of Contextual Immune Model for Web Mining 383

    each wt,d is a weight for term t in document d, so-called term frequency/inversedocument frequency, tdf, weight:

    wt,d = w(t, d) = ftd log(



    ), (1)

    where ftd is the number of occurrences of term t in document d, ft is the num-ber of documents containing term t and N is the total number of documents.

    The vector space model has been criticized for some disadvantages, pol-ysemy and synonymy, among others, [3]. To overcome these disadvantages acontextual approach has been proposed [18] relying upon dividing the set ofdocuments into a number of homogenous and disjoint subgroups (clusters).During the dimensionality reduction process, each of the clusters, called alsocontexts (for reasons obvious later), will be described by a unique subset ofterms.

    In the sequel we will distinguish between hierarchical and contextual modelof documents treatment. In the former the dimensionality reduction process isrun globally, for the whole collection of documents, so that the terms used fordocument description are identical for each subgroup of documents, and thecomputation of tdf weights, dened in equation (1) is based on the wholedocument collection. In the later model, for each subgroup the dimensionalityreduction process is run separately, so that each subgroup may be describedby a dierent subset of terms weighted in accordance with the equation (4).Finally, whenever we do not carry out a clustering of documents and weconstruct a single, at, representation for entire collection we will speakabout a global model.4

    The contextual approach consists of two main stages. At rst stage a hier-archical model is built, i.e. a collection D of documents is recurrently divided by using Fuzzy ISODATA algorithm [4] into homogenous groups consistingof approximately identical number of elements. Such a procedure results in ahierarchy represented by a tree of clusters. The process of partitioning haltswhen the number of documents inside each group meets predened criteria5.To compute the distance dist(d, v) of a document d from a cluster centroid v,the cosine distance was used:

    dist(d, v) = 1 = 1 (d/d)T, (v/v) (2)where the symbol < , > stands for the dot-product of two vectors. GivenmdG, the degree of membership of a document d to a group G, (obtained viathe Fuzzy-ISODATA algorithm) this document is assigned to the group withhighest value of mdG.4 The principal dierence between the hierarchical and the global models