[studies in computational intelligence] computational intelligence in multimedia processing: recent...

Aboul-Ella Hassanien, Ajith Abraham and Janusz Kacprzyk (Eds.)

Computational Intelligence in Multimedia Processing: Recent Advances

Studies in Computational Intelligence, Volume 96

Editor-in-chiefProf. Janusz KacprzykSystems Research InstitutePolish Academy of Sciencesul. Newelska 601-447 WarsawPolandE-mail: [email protected]

Further volumes of this series can be found on ourhomepage: springer.com

Vol. 71. Norio Baba, Lakhmi C. Jain and Hisashi Handa(Eds.)Advanced Intelligent Paradigms in ComputerGames, 2007ISBN 978-3-540-72704-0

Vol. 72. Raymond S.T. Lee and Vincenzo Loia (Eds.)Computation Intelligence for Agent-based Systems, 2007ISBN 978-3-540-73175-7

Vol. 73. Petra Perner (Ed.)Case-Based Reasoning on Images and Signals, 2008ISBN 978-3-540-73178-8

Vol. 74. Robert SchaeferFoundation of Global Genetic Optimization, 2007ISBN 978-3-540-73191-7

Vol. 75. Crina Grosan, Ajith Abraham and Hisao Ishibuchi(Eds.)Hybrid Evolutionary Algorithms, 2007ISBN 978-3-540-73296-9

Vol. 76. Subhas Chandra Mukhopadhyay and Gourab SenGupta (Eds.)Autonomous Robots and Agents, 2007ISBN 978-3-540-73423-9

Vol. 77. Barbara Hammer and Pascal Hitzler (Eds.)Perspectives of Neural-Symbolic Integration, 2007ISBN 978-3-540-73953-1

Vol. 78. Costin Badica and Marcin Paprzycki (Eds.)Intelligent and Distributed Computing, 2008ISBN 978-3-540-74929-5

Vol. 79. Xing Cai and T.-C. Jim Yeh (Eds.)Quantitative Information Fusion for HydrologicalSciences, 2008ISBN 978-3-540-75383-4

Vol. 80. Joachim DiederichRule Extraction from Support Vector Machines, 2008ISBN 978-3-540-75389-6

Vol. 81. K. SridharanRobotic Exploration and Landmark Determination, 2008ISBN 978-3-540-75393-3

Vol. 82. Ajith Abraham, Crina Grosan and WitoldPedrycz (Eds.)Engineering Evolutionary Intelligent Systems, 2008ISBN 978-3-540-75395-7

Vol. 83. Bhanu Prasad and S.R.M. Prasanna (Eds.)Speech, Audio, Image and Biomedical Signal Processingusing Neural Networks, 2008ISBN 978-3-540-75397-1

Vol. 84. Marek R. Ogiela and Ryszard TadeusiewiczModern Computational Intelligence Methodsfor the Interpretation of Medical Images, 2008ISBN 978-3-540-75399-5Vol. 85. Arpad Kelemen, Ajith Abraham and Yulan Liang(Eds.)Computational Intelligence in Medical Informatics, 2008ISBN 978-3-540-75766-5

Vol. 86. Zbigniew Les and Mogdalena LesShape Understanding Systems, 2008ISBN 978-3-540-75768-9

Vol. 87. Yuri Avramenko and Andrzej KraslawskiCase Based Design, 2008ISBN 978-3-540-75705-4

Vol. 88. Tina Yu, David Davis, Cem Baydar and RajkumarRoy (Eds.)Evolutionary Computation in Practice, 2008ISBN 978-3-540-75770-2

Vol. 89. Ito Takayuki, Hattori Hiromitsu, Zhang Minjieand Matsuo Tokuro (Eds.)Rational, Robust, Secure, 2008ISBN 978-3-540-76281-2

Vol. 90. Simone Marinai and Hiromichi Fujisawa (Eds.)Machine Learning in Document Analysisand Recognition, 2008ISBN 978-3-540-76279-9

Vol. 91. Horst Bunke, Kandel Abraham and Last Mark (Eds.)Applied Pattern Recognition, 2008ISBN 978-3-540-76830-2

Vol. 92. Ang Yang, Yin Shan and Lam Thu Bui (Eds.)Success in Evolutionary Computation, 2008ISBN 978-3-540-76285-0

Vol. 93. Manolis Wallace, Marios Angelides and PhivosMylonas (Eds.)Advances in Semantic Media Adaptation andPersonalization, 2008ISBN 978-3-540-76359-8

Vol. 94. Arpad Kelemen, Ajith Abraham and Yuehui Chen(Eds.)Computational Intelligence in Bioinformatics, 2008ISBN 978-3-540-76802-9

Vol. 95. Radu DogaruSystematic Design for Emergence in Cellular NonlinearNetworks, 2008ISBN 978-3-540-76800-5

Vol. 96. Aboul-Ella Hassanien, Ajith Abraham and JanuszKacprzyk (Eds.)Computational Intelligence in Multimedia Processing:Recent Advances, 2008ISBN 978-3-540-76826-5

Aboul-Ella HassanienAjith AbrahamJanusz Kacprzyk(Eds.)

Computational Intelligencein Multimedia Processing:Recent Advances

With 196 Figures and 29 Tables

123

Aboul-Ella HassanienQuantitative and Information SystemDepartmentKuwait University College of BusinessAdministrationP.O. Box 5486Safat Code No. 13055Kuwait

[email protected]

Ajith AbrahamCentre for Quantifiable Quality of Servicein Communication Systems (Q2S)Centre of ExcellenceNorwegian University of Scienceand TechnologyO.S. Bragstads plass 2E7491 TrondheimNorway

[email protected]

Janusz KacprzykSystems Research InstitutePolish Academy of SciencesNewelska 601-447 Warsaw

[email protected]

ISBN 978-3-540-76826-5 e-ISBN 978-3-540-76827-2

Studies in Computational Intelligence ISSN 1860-949X

Library of Congress Control Number: 2007940846

c© 2008 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part of the materialis concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad-casting, reproduction on microfilm or in any other way, and storage in data banks. Duplication ofthis publication or parts thereof is permitted only under the provisions of the German Copyright Lawof September 9, 1965, in its current version, and permission for use must always be obtained fromSpringer-Verlag. Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publication does notimply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.

Cover design: Deblik, Berlin, Germany

Printed on acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

Preface

Multimedia uses multiple forms of information content and processing mainlytext, audio, graphics, animation, and video for communication to cater thevarious user demands. Today, multimedia presentation, etc. are used inmovies, education, entertainment, marketing, advertising, information ser-vices, teleconferencing, publishing, interactive television, product demonstra-tion, and alike. Because of the rapid transfer of information and a growingneed to present this information in a powerful way, only individuals who haveappropriate skills and knowledge to communicate effectively will succeed inthe multimedia industry.

In the last few years, multimedia processing has emerged as an importanttechnology to generate contents based on images, audio, graphics, animation,full-motion video, and text, and it has opened a wide range of applicationsby combining these different of information sources thus giving insights in theinterpretation of the multimedia content. Furthermore, recent new develop-ments such as the high-definition multimedia content and interactive televisioncan lead to the generation of a huge volume of data and imply serious com-puting problems connected with the creation, processing, and management ofmultimedia content. Multimedia processing is a challenging domain for sev-eral reasons as: it requires both high-computational processing requirementsand memory bandwidth; it is a multi-rate computing problem; and it requireslow-cost implementations for high-volume markets.

Computational intelligence is one of the most exciting and rapidly ex-panding fields which attract a large number of scholars, researchers, engineersand practitioners working in such areas as rough sets, neural networks, fuzzylogic, evolutionary computing, artificial immune systems, and swarm intel-ligence. Computational intelligence has been a tremendously active area ofresearch for the past decade or so. There are many successful applications ofcomputational intelligence in many subfields of multimedia, including imageprocessing or retrieval, audio processing, and text processing. However, thereare still numerous open problems in multimedia processing exemplified bymultimedia communication, multimedia computing and computer animation

VI Preface

that need advanced and efficient computational methodologies desperately todeal with the huge volumes of data generated by these problems.

This volume provides an up-to-date and state-of-the-art coverage of diverseaspects related to computational intelligence in multimedia processing. It ad-dresses the use of different computational intelligence-based approaches tovarious problems in multimedia computing, networking and communicationssuch as video processing, virtual reality, movies, audio processing, informa-tion graphics in multimodal documents, multimedia tasks scheduling, mod-eling interactive nonlinear stories, video authentication, text localization inimages, organizing multimedia information, and visual sensor networks. Thisvolume comprises of 19 chapters including an overview chapter providing anup-to-date and state-of-the review of the current literature on computationalintelligence-based approaches to various problems in multimedia computingand communication, and some important research challenges.

The book is divided into five parts devoted to: foundation of computa-tional intelligence in multimedia processing, computational intelligence in 3Dmultimedia virtual environment and video games, computational intelligencein image/audio processing, computational intelligence in multimedia networkstask scheduling; and computational intelligence in video processing.

The part on Foundation of computational intelligence in multimediaprocessing contains two introductory chapters. It presents a broad overview ofcomputational intelligence (CI) techniques including Neural Network (NN),Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Fuzzy Set(FS), Reinforcement Learning (RL) and Rough Sets (RS). In addition, avery brief introduction to near sets and near images which offer a gener-alization of traditional rough set theory and a new approach to classifyingperceptual objects by means of features in solving multimedia problems ispresented. A review of the current literature on CI-based approaches to var-ious problems in multimedia computing, networking and communications ispresented. Challenges to be addressed and future directions of research arealso presented.

Chapter 1, by Aboul-Ella Hassanien, Ajith Abraham, Janusz Kacprzyk,and James F. Peters, presents a review of the current literature on compu-tational intelligence-based approaches to various problems in the multimediacomputing such as speech, audio and image processing, video watermarking,content-based multimedia indexing, and retrieval. The chapter also discussessome representative methods to provide inspiring examples to illustrate howCI could be applied to resolve multimedia computing problems and how mul-timedia could be analyzed, processed, and characterized by computationalintelligence.

Chapter 2, by Parthasarathy Guturu, presents a review of the current lit-erature on computational intelligence-based approaches to various problemsin multimedia networking and communications such as call admission control,management of resources and traffic, routing, multicasting, media composi-

Preface VII

tion, encoding, media streaming and synchronization, and on-demand serversand services.

The part on Computational intelligence in 3D multimedia virtual envi-ronment and video games contains four chapters. It discusses the applicationof computational intelligence techniques in the area of virtual environment(in which humans can interact with a virtual 3D scene and navigate througha virtual environment) and music information retrieval approaches. Dynamicmodels are also employed to obtain a more formal design process for (story-driven) games and on improving the current approaches to interactive story-telling.

In Chap. 3, Ronald Genswaider, Helmut Berger, Michael Dittenbach,Andreas Pesenhofer, Dieter Merkl, Andreas Rauber, and Thomas Lidy intro-duce the MediaSquare, a synthetic 3D multimedia environment that allowsmultiple users to collectively explore multimedia data and interact with eachother. The data is organized within the 3D virtual world either based on con-tent similarity, or by mapping a given structure (e.g. a branch of a file systemhierarchy) into a room structure. With this system it is possible to take advan-tage of spatial metaphors such as relations between items in space, proximityand action, common reference and orientation, as well as reciprocity.

In Chap. 4, Tauseef Gulrez, Manolya Kavakli, and Alessandro Tognettideveloped a testbed for robot-mediated neurorehabilitation therapy that com-bines the use of robotics, computationally intelligent virtual reality, and hapticinterfaces. They employed the theories of neuroscience and rehabilitation todevelop methods for the treatment of neurological injuries such as stroke,spinal cord injury, and traumatic brain injury. As a sensor input they haveused two state-of-the-art technologies, depicting the two different approachesto solve the mobility loss problem. In their experiment, a 52 piezo-resistivesensor laden shirt was used as an input device to capture the residual signalsarising from the patient’s body.

In Chap. 5, Fabio Zambetta builds the case for a story-driven approach tothe design of a computer role-playing game using a mathematical model ofpolitical balance and conflict and scripting based on fuzzy logic. The modelintroduced differs from a standard HCP (hybrid control process) by the useof fuzzy logic (or fuzzy-state machines) to handle events, while an ordinarydifferential equation is needed to generate continuous level of conflict overtime. By using this approach, not only can game designers express game playproperties formally using a quasi-natural language, but they can also proposea diverse role-playing experience to their players. The interactive game storiesdesigned with this methodology can change under the pressure of a variablepolitical balance, and propose a different and innovative game play style.

Time flow is the distinctive structure of various kinds of data, such as mul-timedia movie, electrocardiogram, and stock price quote. To make good use ofthese data, locating desired instant or interval along the time is indispensable.In addition to domain specific methods like automatic TV program segmen-tation, there should be a common means to search these data according to

VIII Preface

the changes along the time flow. Chapter 6, by Ken Nakayama et al. presentsI-string and I-regular expression framework with some examples and a match-ing algorithm. I-string is a symbolic string-like annotation model for contin-uous media which has a virtual continuous branchless time flow. I-regularexpression is a pattern language over I-string, which is an extension of conven-tional regular expression for text search. Although continuous media are oftentreated as a sequence of time-sliced data in practice, the framework adoptscontinuous time flow. This abstraction allows the annotation and search queryto be independent from low-level implementation such as frame rate.

Computational intelligence in image/audio processing is the third part ofthe book. It contains six chapters discussing the application of computationalintelligence techniques in image and audio processing.

In Chap. 7, Barca J.C., Rumantir G., and Li R., present a set of il-luminated contour-based markers for optical motion capture that has beenpresented along with a modified K-means algorithm that can be used forremoving inter-frame noise. The new markers appear to have features thatsolve and/or reduce several of the drawbacks associated with other markersystems currently available for optical motion capture. The new markers pro-vide solutions to central problems with the current standard spherical flashingLED-based markers. The modified K-means algorithm that can be used forremoving noise in optical motion capture data is guided by constraints on thecompactness and number of data points per cluster. Experiments on the pre-sented algorithm and findings in literature indicate that this noise-removingalgorithm outperforms standard filtering algorithms such as the mean and me-dian because it is capable of completely removing noise with both the spikeand Gaussian characteristics.

In Chap. 8, Sandra Carberry and Stephanie Elzer present a corpus studythat shows the importance of taking information graphics into account whenprocessing a multimodal document. It then presents a Bayesian network ap-proach to identifying the message conveyed by one kind of information graphic,simple bar charts, along with an evaluation of the graph understanding system.

In Chap. 9, Klaas Bosteels and Etienne E. Kerre present a recently intro-duced triparametric family of fuzzy similarity measures, together with severalconstraints on its parameters that warrant certain potentially desirable oruseful properties. In particular, they present constraints for several forms ofrestrictability, which allow reducing the computation time in practical appli-cations. They use some members of this family to construct various audiosimilarity measures based on spectrum histograms and fluctuation patterns.

Chapter 10, by Przemys�law Gorecki, Laura Caponetti, and Ciro Castiello,deals with the particular problem of text localization, which aims at deter-mining the exact location where the text is situated inside a document image.The strict connection between text localization and image segmentation ishighlighted in the chapter and a review of methods for image segmentationis proposed. Particularly, the benefits of this chapter and the employment offuzzy and neuro-fuzzy techniques in this field are assessed, thus indicating

Preface IX

a way to combine computational intelligence methods and document imageanalysis. Three peculiar methods based on image segmentation are presentedto show different applications of fuzzy and neuro-fuzzy techniques in the con-text of text localization.

In Chap. 11, Kui Wu and Kim-Hui Yap, present a soft-labeling frame-work that addresses the small sample problem in interactive CBIR systems.The technique incorporates soft-labeled images into the fuzzy support vectormachine (FSVM) for effective learning along with labeled images for effectiveretrieval. By exploiting the characteristics of the labeled images, soft-labeledimages are selected through an unsupervised clustering algorithm. Further,the relevance of the soft-labeled images is estimated using the fuzzy member-ship function. The FSVM-based active learning is then performed based onthe hybrid of soft-labeled and explicitly labeled images. Experimental resultsbased on a database of 10,000 images demonstrate the effectiveness of theproposed method.

Temporal textures are textures with motion like real world image sequencesof sea-waves, smoke, etc. that possess some stationary properties over spaceand time. The motion assembly by a flock of flying birds, water streams,fluttering leaves, and waving flags also serve to illustrate such a motion. Thecharacterization of temporal textures is of a vital importance to computervision, electronic entertainment, and content-based video coding research witha number of potential applications in areas including recognition (automatedsurveillance and industrial monitoring), synthesis (animation and computergames), and segmentation (robot navigation and MPEG-4). Chapter 12, byAshfaqur Rahman and Manzur Murshed, provides a comprehensive literaturesurvey of the existing temporal texture characterization techniques.

The fourth part, Computational intelligence in multimedia networks andtask scheduling contains four chapters that describe several approaches todevelop video analysis and segmentation systems based on visual sensor net-works using computational intelligence as well as a discussion about detectinghotspots in the cockpits in view of the Swissair 111 and ValuJet 592 flightdisasters, and answer the question that how distributed sensor networks couldhelp in near real-time event detection, disambiguating faults and events byusing artificial intelligence techniques. In addition, it contains a chapter re-viewing the current literature on computational intelligence-based approachesto various problems in multimedia networking and communications.

In Chap. 13, Mitsuo Gen and Myungryun Yoo discuss a task schedulingproblem by introducing many scheduling algorithms for soft real-time tasks us-ing a genetic algorithm (GA). They propose reasonable solutions for NP-hardscheduling problem with much less difficulties than those solved by traditionalmathematical methods. In addition, a continuous task scheduling, real-timetask scheduling on homogeneous system and real-time task scheduling on het-erogeneous system are discussed and included in this chapter.

Chapter 14, by Miguel A. Patricio, F. Castanedo, A. Berlanga, O. Perez,J. Garcia, and Jose M. Molina, describes several approaches to develop video

X Preface

analysis and segmentation systems based on visual sensor networks usingcomputational intelligence. They discuss how computational intelligence para-digms can help obtain competitive solutions. The knowledge about the domainis used in the form of fuzzy rules for data association and heuristic evaluationfunctions to optimize the design and guide the search of appropriate decisions.

In Chap. 15, S�lawomir T. Wierzchon, Krzysztof Ciesielski, and Mieczys�lawA. K�lopotek, focus on some problems concerning application of an immune-based algorithm to extraction and visualization of cluster structure. The chap-ter presents a novel approach, based on artificial immune systems, within abroad stream of map type clustering methods. Such approach leads to manyinteresting research issues, such as context-dependent dictionary reductionand keywords identification, topic-sensitive document summarization, subjec-tive model visualization based on particular user’s information requirements,dynamic adaptation of the document representation and local similarity mea-sure computation.

In Chap. 16, S. Srivathsan, N. Balakrishnan, and S.S. Iyengar discusssome safety issues in commercial planes particularly focusing on hazards inthe cockpit area. The chapter discusses a few methodologies to detect criticalfeatures and provides unambiguous information about the possible sourcesof hazards to the end user in near real time. They explore the applicationof Bayesian probability, the Iyengar–Krishnamachari method, probabilisticreasoning, reasoning under uncertainty, and the Dempster–Shafer theory, andanalyze how these theories could help in the data analysis gathered fromwireless sensor networks deployed in the cockpit area.

The final part of the book deals with the use of computational intelli-gence in video processing. It contains three chapters which discuss the use ofcomputational intelligence techniques in video processing.

In Chap. 17, Nicholas Vretos, Vassilios Solachidis, and Ioannis Pitas pro-vide a uniform framework by which media analysis can be rendered moreuseful for retrieval applications as well as for human-computer interaction-based application. All the algorithms presented in this chapter are focused onhumans and thus provides interesting features for an anthropocentric analysisof a movie.

In Chap. 18, Thomas Barecke, Ewa Kijak, Marcin Detyniecki, and AndreasNurnberger present an innovative way of automatically organizing multimediainformation to facilitate content-based browsing. It is based on self-organizingmaps. The visualization capabilities of the self-organizing map provide an in-tuitive way of representing the distribution of data as well as the object simi-larities. The main idea is to visualize similar documents spatially close to eachother, while the distance between different documents is larger. They intro-duce a novel time bar visualization that re-projects the temporal information.

In Chap. 19, Mayank Vatsa, Richa Singh, Sanjay K. Singh, and SaurabhUpadhyay, present an efficient intelligent video authentication algorithm usingsupport vector machine. The proposed algorithm can detect multiple videotampering attacks. It computes the local relative correlation information and

Preface XI

classifies the video that is nontampered. The proposed algorithm computesthe relative correlation information between all the adjacent frames of a videoand projects them into a nonlinear SVM hyperplane to determine if the videois tampered or not. The algorithm is validated on an extensive video databasecontaining 795 tampered and nontampered videos. The results show that theproposed algorithm yields a classification accuracy of 99.2%.

Acknowledgements

We are very much grateful to the authors of this volume and to the re-viewers for their extraordinary service by critically reviewing the chapters.Most of the authors of the chapters included in this book also served asreferees for the chapters written by other authors. Thanks go to all thosewho provided constructive and comprehensive reviews. The editors thank Dr.Thomas Ditzinger Springer-Verlag, Germany for the editorial assistance andexcellent cooperative collaboration to produce this important scientific work.We hope that the reader will share our excitement to present this volumeon Computational Intelligence in Multimedia Processing: Recent Advance andwill find it useful.

Aboul-Ella HassanienJanusz Kacprzyk

Ajith Abraham

Contents

Part I Foundation

Computational Intelligence in Multimedia Processing:Foundation and TrendsAboul-Ella Hassanien, Ajith Abraham, Janusz Kacprzyk,and James F. Peters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Computational Intelligence in Multimedia Networkingand Communications: Trends and Future DirectionsParthasarathy Guturu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Part II Computational Intelligence in 3D Multimedia VirtualEnvironment and Video Games

A Synthetic 3D Multimedia EnvironmentRonald Genswaider, Helmut Berger, Michael Dittenbach, AndreasPesenhofer, Dieter Merkl, Andreas Rauber, and Thomas Lidy . . . . . . . . . . 79

Robotics and Virtual Reality: A Marriage of Two DiverseStreams of ScienceTauseef Gulrez, Manolya Kavakli, and Alessandro Tognetti . . . . . . . . . . . . 99

Modelling Interactive Non-Linear StoriesFabio Zambetta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

A Time Interval String Model for Annotating and SearchingLinear Continuous MediaKen Nakayama, Kazunori Yamaguchi, Theodorus Eric Setiadi,Yoshitake Kobayashi, Mamoru Maekawa, Yoshihisa Nitta,and Akihiko Ohsuga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

XIV Contents

Part III Computational Intelligence in Image/Audio Processing

Noise Filtering of New Motion Capture MarkersUsing Modified K-MeansJ.C. Barca, G. Rumantir, and R. Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Toward Effective Processing of Information Graphicsin Multimodal Documents: A Bayesian Network ApproachSandra Carberry and Stephanie Elzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

Fuzzy Audio Similarity Measures Basedon Spectrum Histograms and Fluctuation PatternsKlaas Bosteels and Etienne E. Kerre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Fuzzy Techniques for Text Localisation in ImagesPrzemys�law Gorecki, Laura Caponetti, and Ciro Castiello . . . . . . . . . . . . . 233

Soft-Labeling Image Scheme Using Fuzzy Support VectorMachineKui Wu and Kim-Hui Yap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

Temporal Texture Characterization: A ReviewAshfaqur Rahman and Manzur Murshed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

Part IV Computational Intelligence in Multimedia Networksand Task Scheduling

Real Time Tasks Scheduling Using Hybrid Genetic AlgorithmMitsuo Gen and Myungryun Yoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

Computational Intelligence in Visual Sensor Networks:Improving Video Processing SystemsMiguel A. Patricio, F. Castanedo, A. Berlanga, O. Perez, J. Garcıa,and Jose M. Molina . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

Scalability and Evaluation of Contextual Immune Modelfor Web MiningS�lawomir T. Wierzchon, Krzysztof Ciesielski, and Mieczys�lawA. K�lopotek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

Critical Feature Detection in Cockpits – Application of AIin Sensor NetworksS. Srivathsan, N. Balakrishnan, and S.S. Iyengar . . . . . . . . . . . . . . . . . . . . 409

Contents XV

Part V Computational Intelligence in Video Processing

Anthropocentric Semantic Information Extractionfrom MoviesNicholas Vretos, Vassilios Solachidis, and Ioannis Pitas . . . . . . . . . . . . . . . 437

Organizing Multimedia Information with MapsThomas Barecke, Ewa Kijak, Marcin Detyniecki,and Andreas Nurnberger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

Video Authentication Using Relative Correlation Informationand SVMMayank Vatsa, Richa Singh, Sanjay K. Singh, and Saurabh Upadhyay . . 511

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531

Part I

Foundation

Computational Intelligence in MultimediaProcessing: Foundation and Trends

Aboul-Ella Hassanien1,2, Ajith Abraham3, Janusz Kacprzyk4,and James F. Peters5

1 Information Technology Department, FCICairo University5 Ahamed Zewal Street, Orman, Giza, [email protected]

2 Information System Department, CBAKuwait University, [email protected]

3 Center for Quantifiable Quality of Service in Communication SystemsNorwegian University of Science and TechnologyO.S. Bragstads plass 2E, N-7491 Trondheim, [email protected], [email protected]

4 Systems Research Institute Polish Academy of Sciencesul. Newelska 6 01-447 Warsaw, [email protected]

5 Department of Electrical and Computer EngineeringUniversity of ManitobaWinnipeg, Manitoba R3T 5V6, [email protected]

Summary. This chapter presents a broad overview of Computational Intelligence(CI) techniques including Neural Network (NN), Particle Swarm Optimization(PSO), Evolutionary Algorithm (GA), Fuzzy Set (FS), and Rough Sets (RS). Inaddition, a very brief introduction to near sets and near images which offer a gener-alization of traditional rough set theory and a new approach to classifying perceptualobjects by means of features in solving multimedia problems is presented. A reviewof the current literature on CI based approaches to various problems in multime-dia computing such as speech, audio and image processing, video watermarking,content-based multimedia indexing and retrieval are presented. We discuss somerepresentative methods to provide inspiring examples to illustrate how CI could beapplied to resolve multimedia computing problems and how multimedia could beanalyzed, processed, and characterized by computational intelligence. Challenges tobe addressed and future directions of research are also presented.

A.-E. Hassanien et al.: Computational Intelligence in Multimedia Processing: Foundation and

Trends, Studies in Computational Intelligence (SCI) 96, 3–49 (2008)

www.springerlink.com c© Springer-Verlag Berlin Heidelberg 2008

4 A.-E. Hassanien et al.

1 Introduction

Last few decades have seen a new era of artificial intelligence focusing on theprinciples, theoretical aspects, and design methodology of algorithms gleanedfrom nature. Examples are artificial neural networks inspired by mammalianneural systems, evolutionary computation inspired by natural selection in bi-ology, simulated annealing inspired by thermodynamics principles and swarmintelligence inspired by collective behavior of insects or micro-organisms, etc.,interacting locally with their environment causing coherent functional globalpatterns to emerge. Computational intelligence is a well-established paradigm,where new theories with a sound biological understanding have been evolv-ing. The current experimental systems have many of the characteristics ofbiological computers (brains in other words) and are beginning to be builtto perform a variety of tasks that are difficult or impossible to do withconventional computers. Defining computational intelligence is not an easytask [95]. In a nutshell, which becomes quite apparent in light of the currentresearch pursuits, the area is heterogeneous as being dwelled on such technolo-gies as neural networks, fuzzy systems, rough sets, evolutionary computation,swarm intelligence, probabilistic reasoning [13] and multi-agent systems. Therecent trend is to integrate different components to take advantage of com-plementary features and to develop a synergistic system. Hybrid architectureslike neuro-fuzzy systems, evolutionary-fuzzy systems, evolutionary-neural net-works, evolutionary neuro-fuzzy systems, rough-neural, rough-fuzzy, etc., arewidely applied for real world problem solving.

Multimedia is any multiple forms of media integrated together at a time.In modern times, the advent of musical accompaniment to silent films was anearly form of multimedia. Even the simplest ancient dance forms use multiplemedia types in the form of sound and vision to convey additional meaning. Thecurrently accepted understanding of multimedia generally involves a varietyof media, such as still images, video, sound, music and text, presented usinga computer as the storage device, delivery controller and delivery medium.The various media types are usually stored as digital assets and their deliveryto the viewer is facilitated by some sort of authoring language. The multi-media technology is one kind development rapid the natural subinformationtechnique, it changes computer and brings a profound revolution. Multimediatechnique will accelerate the development of our live. Even nowadays, mostmedia types are only designed to be perceived by two senses, vision and hear-ing. Still, incredibly powerful messages can be communicated using just thesetwo senses. A subset of multimedia is interactive multimedia. In this definitionthe delivery of the assets is dependent on decisions made by the viewer at thetime of viewing. Some subject areas lend themselves to interactivity, such asself-paced learning and game play. Other areas are mostly not enhanced byinteractivity: here we find the traditional film and storytelling genres, wherewe are expected to travel in a prescribed direction to perceive the message ina sequential fashion.

Computational Intelligence in Multimedia Processing 5

Current research of multimedia processing is shifting from coding (MPEG-1,2,4) to automatic recognition (MPEG-7). Its research domain will cover tech-niques for object-based representation and coding; segmentation and tracking;pattern detection and recognition; multimodal signals fusion, conversion andsynchronization, as well as content-based indexing and subject-based retrievaland browsing.

Multimedia processing is a very important scientific research domain witha broad range of applications. The development of new insights and applica-tions results from both fundamental scientific research and the development ofnew technologies. One of these emerging technologies is computational intelli-gence, which is a generic term for a specific collection of tools to model uncer-tainty, imprecision, evolutionary behavior and complex models. This chapterwill be a comprehensive view of modern computational intelligence theory inthe field of multimedia processing.

The objective of this book chapter is to present to the computationalintelligence techniques and multimedia processing research communities thestate of the art in the computational intelligence applications to multime-dia processing and motivate research in new trend-setting directions. Hence,we review and discuss in the following Sections some representative methodsto provide inspiring examples to illustrate how CI techniques could be ap-plied to resolve multimedia problems and how multimedia could be analyzed,processed, and characterized by computational intelligence. These representa-tive examples include (1) Computational Intelligence for speech, audio, imageand video processing, (2) CI in audio–visual recognition systems, (3) Compu-tational Intelligence in multimedia watermarking, and (4) CI in multimediacontent-based indexing and retrieval.

To provide useful insights for CI applications in multimedia processing,we structure the rest of this chapter into five Sections. Section 2 introducesthe fundamental aspects of the key components of modern computational in-telligence including neural networks, rough sets, fuzzy sets, particle swarmoptimization algorithm, evolutionary algorithm and near sets. Section 3 re-views some past literature in using the computational intelligence in speech,audio, and image processing, as well as in speech emotion recognition andaudio–visual recognition systems. A review of the current literature on com-putational intelligence based approaches in video processing problems suchas video segmentation as well as adaptation of c-means clustering algorithmto rough set theory in solving multimedia segmentation and clustering prob-lems is presented in Sect. 4. Section 5 reviews and discuss some successfulwork to illustrate how CI could be applied to multimedia watermarking prob-lems. Computational intelligence in content-based multimedia indexing andretrieval is reviewed in Sect. 6. Challenges and future trends are addressedand presented in Sect. 7.


2 Computational Intelligence: Foundations

In the following subsection, we present an overview of the modern computa-tional intelligence techniques with their advantages including neural networks,fuzzy sets, particle swarm optimization, genetic algorithm, rough sets and nearsets.

2.1 Artificial Neural Networks

Artificial neural networks have been developed as generalizations of mathe-matical models of biological nervous systems. In a simplified mathematicalmodel of the neuron, the effects of the synapses are represented by connectionweights that modulate the effect of the associated input signals, and the non-linear characteristic exhibited by neurons is represented by a transfer function.There are a range of transfer functions developed to process the weighted andbiased inputs, among which four basic transfer functions widely adopted formultimedia processing are illustrated in Fig. 1.

The neuron impulse is then computed as the weighted sum of the inputsignals, transformed by the transfer function. The learning capability of anartificial neuron is achieved by adjusting the weights in accordance to thechosen learning algorithm. Most applications of neural networks fall into thefollowing categories:

• Prediction: Use input values to predict some output• Classification: Use input values to determine the classification• Data association: Like classification but it also recognizes data that con-

tains errors• Data conceptualization: Analyze the inputs so that grouping relationships

can be inferred

Mathematical Modeling and Learning in Neural Networks

A typical multilayered neural network and an artificial neuron are illustratedin Fig. 2.

Fig. 1. Basic transfer functions


Fig. 2. Typical multilayered neural network

Each neuron is characterized by an activity level (representing the stateof polarization of a neuron), an output value (representing the firing rate ofthe neuron), a set of input connections, (representing synapses on the celland its dendrite), a bias value (representing an internal resting level of theneuron), and a set of output connections (representing a neuron’s axonal pro-jections). Each of these aspects of the unit is represented mathematicallyby real numbers. Thus each connection has an associated weight (synapticstrength), which determines the effect of the incoming input on the activationlevel of the unit. The weights may be positive or negative. Referring to Fig. 2,the signal flow from inputs {x1, . . . , xn} is considered to be unidirectional in-dicated by arrows, as is a neuron’s output signal flow (O). The neuron outputsignal O is given by the following relationship:

O = f(net) = f

(n∑

j=1

wjxj

), (1)

where wj is the weight vector and the function f(net) is referred to as anactivation (transfer) function. The variable net is defined as a scalar productof the weight and input vectors

net = wT x = w1x1 + · · · + wnxn , (2)

where T is the transpose of a matrix. A typical Gaussian and logistic activationfunction is plotted in Fig. 3.

Neural Network Architecture

The behavior of the neural network depends largely on the interaction be-tween the different neurons. The basic architecture consists of three types ofneuron layers: input, hidden and output layers. In feed-forward networks, thesignal flow is from input to output units strictly in a feed-forward direction.The data processing can extend over multiple (layers of) units, but no feed-back connections are present, that is, connections extending from outputs of


Fig. 3. Typical Gaussian and logistic activation function

units to inputs of units in the same layer or previous layers. Recurrent net-works contain feedback connections. Contrary to feed-forward networks, thedynamical properties of the network are important. In some cases, the acti-vation values of the units undergo a relaxation process such that the networkwill evolve to a stable state in which these activations do not change anymore.In other applications, the changes of the activation values of the output neu-rons are significant, such that the dynamical behavior constitutes the outputof the network. There are several other neural network architectures (Elmannetwork, adaptive resonance theory maps, competitive networks, etc.) depend-ing on the properties and requirement of the application. Reader may referto [2] for an extensive overview of the different neural network architecturesand learning algorithms. A neural network has to be configured such that theapplication of a set of inputs produces the desired set of outputs. Variousmethods to set the strengths of the connections exist. One way is to set theweights explicitly, using a priori knowledge. Another way is to train the neuralnetwork by feeding it teaching patterns and letting it change its weights ac-cording to some learning rule. The learning situations in neural networks maybe classified into three distinct sorts. These are supervised learning, unsuper-vised learning and reinforcement learning. In supervised learning, an inputvector is presented at the inputs together with a set of desired responses, one


for each node, at the output layer. A forward pass is done and the errors ordiscrepancies, between the desired and actual response for each node in theoutput layer, are found. These are then used to determine weight changes inthe net according to the prevailing learning rule. The term ‘supervised’ orig-inates from the fact that the desired signals on individual output nodes areprovided by an external teacher. The best-known examples of this techniqueoccur in the backpropagation algorithm, the delta rule and perceptron rule.In unsupervised learning (or self-organization) a (output) unit is trained torespond to clusters of pattern within the input. In this paradigm the systemis supposed to discover statistically salient features of the input population.Unlike the supervised learning paradigm, there is no a priori set of categoriesinto which the patterns are to be classified; rather the system must developits own representation of the input stimuli. Reinforcement learning is learningwhat to do – how to map situations to actions – so as to maximize a numericalreward signal. The learner is not told which actions to take, as in most formsof machine learning, but instead must discover which actions yield the mostreward by trying them. In the most interesting and challenging cases, actionsmay affect not only the immediate reward, but also the next situation and,through that, all subsequent rewards. These two characteristics, trial-and-error search and delayed reward are the two most important distinguishingfeatures of reinforcement learning.

Major Neural Network Architecture and Learning Models

Via selection of transfer function and connection of neurons, various neuralnetworks can be constructed to be trained for producing the specified outputs.Major neural networks that are commonly used for multimedia applicationsare classified as feed-forward neural network, feedback network or recurrent,self-organizing map and Adaptive Resonance Theory (ART) networks. Thelearning paradigms for the neural networks in multimedia processing gener-ally include supervised networks and unsupervised networks. In supervisedtraining, the training data set consists of many pairs in the source and targetpatterns. The network processes the source inputs and compares the resultingoutputs against the target outputs, and adjusts its weights to improve thecorrect rate of the resulting outputs. In unsupervised networks, the trainingdata set does not include any target information.

Feed-Forward Neural Network

A general Feed-forward network often consists of multiple layers, typicallyincluding one input layer, a number of hidden layers, and an output layer.In the feed-forward neural networks, the neuron in each layer are only fullyinterconnected with the neurons in the next layer, which means signals orinformation being processed travel along a single direction. Back-propagation(BP) network is a supervised feed-forward neural network and it is a simple


stochastic gradient descent method to minimize the total squared error of theoutput computed by the neural network. Its errors propagate backwards fromthe output neurons to the inner neurons. The processes of adjusting the setof weights between the layers and recalculating the output continue until astopping criterion is satisfied. The Radial basis function (RBF) network isa three-layer supervised feed-forward network that uses a nonlinear transferfunction (normally the Gaussian) for the hidden neurons and a linear transferfunction for the output neurons. The Gaussian function is usually appliedto the net input to produce a radial function of the distance between eachpattern vector and each hidden unit weight vector.

Recurrent Networks

Recurrent networks are the state-of-the-art in nonlinear time series predic-tion, system identification, and temporal pattern classification. As the out-put of the network at time t is used along with a new input to computethe output of the network at time t + 1, the response of the network is dy-namic. Time-Lag Recurrent Networks (TLRN) are multi-layered perceptronsextended with short-term memory structures that have local recurrent con-nections. The TLRN is a very appropriate model for processing temporal(time-varying) information. Examples of temporal problems include time se-ries prediction, system identification and temporal pattern recognition. Thetraining algorithm used with TLRNs (backpropagation through time) is moreadvanced than standard backpropagation algorithm. The main advantage ofTLRNs is the smaller network size required to learn temporal problems whencompared to MLP that use extra inputs to represent the past samples (equiv-alent to time delay neural networks). An added advantage of TLRNs is theirlow sensitivity to noise.

Self Organizing Feature Maps

Self Organizing Feature Maps (SOFM) are a data visualization techniqueproposed by Kohonen [3], which reduce the dimensions of data through theuse of self-organizing neural networks. A SOFM learns the categorization,topology and distribution of input vectors. SOFM allocate more neurons torecognize parts of the input space where many input vectors occur and allocatefewer neurons to parts of the input space where few input vectors occur.Neurons next to each other in the network learn to respond to similar vectors.SOFM can learn to detect regularities and correlations in their input andadapt their future responses to that input accordingly. An important featureof SOFM learning algorithm is that it allow neurons that are neighbors tothe winning neuron to output values. Thus the transition of output vectors ismuch smoother than that obtained with competitive layers, where only oneneuron has an output at a time. The problem that data visualization attemptsto solve is that humans simply cannot visualize high dimensional data. Theway SOFM go about reducing dimensions is by producing a map of usually


1 or 2 dimensions, which plot the similarities of the data by grouping similardata items together (data clustering). In this process, SOFM accomplish twothings, they reduce dimensions and display similarities. It is important to notethat while a self-organizing map does not take long to organize itself so thatneighboring neurons recognize similar inputs, it can take a long time for themap to finally arrange itself according to the distribution of input vectors.

Adaptive Resonance Theory

Adaptive Resonance Theory (ART) was initially introduced by Grossberg [5]as a theory of human information processing. ART neural networks are exten-sively used for supervised and unsupervised classification tasks and functionapproximation. There are many different variations of ART networks avail-able today [4]. For example, ART1 performs unsupervised learning for binaryinput patterns, ART2 is modified to handle both analog and binary input pat-terns, and ART3 performs parallel searches of distributed recognition codesin a multilevel network hierarchy. ARTMAP combines two ART modules toperform supervised learning while fuzzy ARTMAP represents a synthesis ofelements from neural networks, expert systems, and fuzzy logic.

2.2 Rough Sets

Rough set theory [75–77,87] is a fairly new intelligent technique for managinguncertainty that has been applied to the medical domain and is used for thediscovery of data dependencies, evaluates the importance of attributes, discov-ers the patterns of data, reduces all redundant objects and attributes, seeksthe minimum subset of attributes, recognize and classify objects in medicalimaging. Moreover, it is being used for the extraction of rules from data-bases. Rough sets have proven useful for representation of vague regions inspatial data. One advantage of the rough set is the creation of readable if–thenrules. Such rules have a potential to reveal new patterns in the data mate-rial; furthermore, it also collectively functions as a classifier for unseen datasets. Unlike other computational intelligence techniques, rough set analysisrequires no external parameters and uses only the information presented inthe given data. One of the nice features of rough sets theory is that its cantell whether the data is complete or not based on the data itself. If the datais incomplete, it suggests more information about the objects needed to becollected in order to build a good classification model. On the other hand, ifthe data is complete, rough sets can determine whether there are more thanenough or redundant information in the data and find the minimum dataneeded for classification model. This property of rough sets is very importantfor applications where domain knowledge is very limited or data collection isvery expensive/laborious because it makes sure the data collected is just goodenough to build a good classification model without sacrificing the accuracy ofthe classification model or wasting time and effort to gather extra informationabout the objects [75–77,87].


In rough sets theory, the data is collected in a table, called decision table.Rows of the decision table correspond to objects, and columns correspond toattributes. In the data set, we assume that the a set of examples with a classlabel to indicate the class to which each example belongs are given. We callthe class label the decision attributes, the rest of the attributes the conditionattributes. Rough sets theory defines three regions based on the equivalentclasses induced by the attribute values: lower approximation, upper approxi-mation and boundary. Lower approximation contains all the objects, which areclassified surely based on the data collected, and upper approximation con-tains all the objects which can be classified probably, while the boundary isthe difference between the upper approximation and the lower approximation.So, we can define a rough set as any set defined through its lower and upperapproximations. On the other hand, indiscernibility notion is fundamental torough set theory. Informally, two objects in a decision table are indiscernible ifone cannot distinguish between them on the basis of a given set of attributes.Hence, indiscernibility is a function of the set of attributes under consider-ation. For each set of attributes we can thus define a binary indiscernibilityrelation, which is a collection of pairs of objects that are indiscernible to eachother. An indiscernibility relation partitions the set of cases or objects into anumber of equivalence classes. An equivalence class of a particular object issimply the collection of objects that are indiscernible to the object in question.Here we provide an explanation of the basic framework of rough set theory,along with some of the key definitions. A review of this basic material can befound in sources such as [74–77,87] and many others.

2.3 Near Sets: Generalization of the Rough Set in MultimediaProcessing

Near sets [67, 78–81, 83] offer a generalization of traditional rough set the-ory [84–88] and a new approach to classifying perceptual objects by means offeatures [89–94]. The near set approach can be used to classify images thatare qualitatively but not necessary quantitatively close to each other. This isessentially the idea expressed in classifying images in [67, 81]. If one adoptsthe near set approach in image processing, a byproduct of the approach is theseparation of images into non-overlapping sets of images that are similar (de-scriptively near to) each other. This has recently led to an application of thenear set approach in 2D and 3D interactive gaming with a vision system thatlearns and serves as the backbone for an adaptive telerehabilitation system forpatients with finger, hand, arm and balance disabilities (see, e.g., [100, 101]).Each remote node in the telerehabilitation system includes a vision systemthat learns to track the behavior of a patient. Images deemed to be ‘interest-ing’ (e.g., images representing erratic behavior) are stored as well as forwardedto a rehabilitation center for followup. In such a system, there is a need toidentify images that are in some sense near images representing some standardor norm. This research has led to a study of methods of automating image


segmentation as a first step in near set-based image processing. This sectionis limited to a very brief introduction to near sets and near images useful inimage pattern recognition.

Object Description

Perceptual objects that have the same appearance are considered qualitativelynear each other, i.e., objects with matching descriptions. A description is atuple of values of functions representing features of an object [79]. For sim-plicity, assume the description of an object consists of one function value. Forexample, let w ∈ I, w′ ∈ I ′ be n × m pixel windows contained in two imagesI, I ′ and φ(w) = information content of pixel window w, where informationcontent is a feature of a pixel window and φ is a sample function representinginformation content defined in the usual way [99]. Then pixel window w isnear pixel window w′ if φ(w) = φ(w′).

Near Objects

Objects are known by their descriptions. An object description is defined bymeans of a tuple of function values φ(x) associated with an object x ∈ X.Assume that B ⊆ F is a given set of functions representing features of sampleobjects X ⊆ O. Let φi ∈ B, where φi : O −→ �. In combination, thefunctions representing object features provide a basis for an object descriptionφ : O −→ �L, a vector containing measurements (returned values) associatedwith each functional value φi (x) in (3), where the description length |φ| = L.

Object Description: φ(x) = (φ1(x), φ2(x), . . . , φi(x), . . . , φL(x)). (3)

The intuition underlying a description φ(x) is a recording of measurementsfrom sensors, where each sensor is modeled by a function φi. Then let ∆φi

denote∆φi = φi(x′) − φi(x),

where x, x′ ∈ O. The difference ∆φ leads to a definition of the indiscernibilityrelation ∼B introduced by Pawlak [86] (see Definition 1).

Definition 1. Indiscernibility Relation Let x, x′ ∈ O, B ∈ F .

∼B= {(x, x′) ∈ O × O | ∀φi ∈ B � ∆φi = 0} ,

where i ≤ |φ| (description length).

Near Sets

The basic idea in the near set approach to object recognition is to compareobject descriptions. Sets of objects X,X ′ are considered near each other if thesets contain objects with at least partial matching descriptions.


Definition 2. Near Sets Let X, X ′ ⊆ O, B ⊆ F . Set X is near X ′ if, andonly if there exists x ∈ X,x′ ∈ X ′, φi ∈ B such that x ∼{φi} x′.

For example, assume that a pair of images I, I ′, where a pixel window inimage I has a description that matches the description of a pixel window inimage I ′. The objects in this case are pixel windows. By definition, I, I ′ arenear sets and, from an image classification perspective, I, I ′ are near images.Object recognition problems, especially in images [67], and the problem ofthe nearness of objects have motivated the introduction of near sets (see,e.g., [81, 83]).

Near Images

In the context of image processing, the relation ∼B in Definition 1 is importantbecause it suggests a way to classify images by a number of straightforwardsteps: (1) identify an image object, e.g., pixel window, (2) select a set Bcontaining functions representing features of an image object such as a pixelwindow, (3) partition each image using ∼B and then compare a representativeobject from a class in each partition. In the case where one discovers that theobjects in the selected classes have matching descriptions, then this meansthe images are near each other at the class level. In effect, if near images arediscovered, this means a pair of sample images have been effectively classified.This is important because it leads to effective image segmentation method.

2.4 Fuzzy Sets

Zadeh [115] introduced the concept of fuzzy logic to present vagueness inlinguistics, and further implement and express human knowledge and inferencecapability in a natural way. Fuzzy logic starts with the concept of a fuzzy set.A fuzzy set is a set without a crisp, clearly defined boundary. It can containelements with only a partial degree of membership. A Membership Function(MF) is a curve that defines how each point in the input space is mapped to amembership value (or degree of membership) between 0 and 1. The input spaceis sometimes referred to as the universe of discourse. Let X be the universeof discourse and x be a generic element of X. A classical set A is definedas a collection of elements or objects x ∈ X, such that each x can eitherbelong to or not belong to the set A, A � X. By defining a characteristicfunction (or membership function) on each element x in X, a classical set Acan be represented by a set of ordered pairs (x, 0) or (x, 1), where 1 indicatesmembership and 0 non-membership. Unlike conventional set mentioned abovefuzzy set expresses the degree to which an element belongs to a set. Hencethe characteristic function of a fuzzy set is allowed to have value between 0and 1, denoting the degree of membership of an element in a given set. If Xis a collection of objects denoted generically by x, then a fuzzy set A in X isdefined as a set of ordered pairs:


Fig. 4. Shapes of two commonly used MF’s

A = {(x, µA(x)) | x ∈ X}, (4)

µA(x) is called the membership function of linguistic variable x in A, whichmaps X to the membership space M , M = [0, 1], where M contains onlytwo points 0 and 1, A is crisp and µA(x) is identical to the characteristicfunction of a crisp set. Triangular and trapezoidal membership functions arethe simplest membership functions formed using straight lines. Some of theother shapes are Gaussian, generalized bell, sigmoidal and polynomial basedcurves.

Figure 4 illustrates the shapes of two commonly used MF’s. The mostimportant thing to realize about fuzzy logical reasoning is the fact that it isa superset of standard Boolean logic.

Fuzzy Logic Operators

It is interesting to note about the correspondence between two-valued andmulti-valued logic operations for AND, OR, and NOT. It is possible to re-solve the statement A AND B, where A and B are limited to the range (0,1),by using the operator minimum (A,B). Using the same reasoning, we canreplace the OR operation with the maximum operator, so that A OR B be-comes equivalent to maximum (A,B). Finally, the operation NOT A becomesequivalent to the operation 1 − A. In fuzzy logic terms these are popularlyknown as fuzzy intersection or conjunction (AND), fuzzy union or disjunction(OR), and fuzzy complement (NOT). The intersection of two fuzzy sets Aand B is specified in general by a binary mapping T , which aggregates twomembership functions as follows.

µA∩B(x) = T (µA(x), µB(x)) (5)

The fuzzy intersection operator is usually referred to as T -norm (Trian-gular norm) operator. The fuzzy union operator is specified in general by abinary mapping S.

µA∪B(x) = S(µA(x), µB(x)) (6)

This class of fuzzy union operators are often referred to as T -conorm (orS-norm) operators 5.


If–then Rules and Fuzzy Inference Systems

The fuzzy rule base is characterized in the form of if–then rules in whichpreconditions and consequents involve linguistic variables. The collection ofthese fuzzy rules forms the rule base for the fuzzy logic system. Due to theirconcise form, fuzzy if–then rules are often employed to capture the imprecisemodes of reasoning that play an essential role in the human ability to makedecisions in an environment of uncertainty and imprecision. A single fuzzyif–then rule assumes the form:

if x is A then y is B,

where A and B are linguistic values defined by fuzzy sets on the ranges (uni-verses of discourse) X and Y, respectively. The if -part of the rule ′x is A′ iscalled the antecedent (pre-condition) or premise, while the then-part of therule ′y is B′ is called the consequent or conclusion. Interpreting an if–then ruleinvolves evaluating the antecedent (fuzzification of the input and applying anynecessary fuzzy operators) and then applying that result to the consequent(known as implication). For rules with multiple antecedents, all parts of theantecedent are calculated simultaneously and resolved to a single value usingthe logical operators. Similarly all the consequents (rules with multiple con-sequents) are affected equally by the result of the antecedent. The consequentspecifies a fuzzy set be assigned to the output. The implication function thenmodifies that fuzzy set to the degree specified by the antecedent. For multiplerules, the output of each rule is a fuzzy set. The output fuzzy sets for eachrule are then aggregated into a single output fuzzy set. Finally the resultingset is defuzzified, or resolved to a single number. The defuzzification interfaceis a mapping from a space of fuzzy actions defined over an output universeof discourse into a space of non-fuzzy actions, because the output from theinference engine is usually a fuzzy set while for most practical applicationscrisp values are often required. The three commonly applied defuzzificationtechniques are, max-criterion, center-of-gravity and the mean- of- maxima.The max-criterion is the simplest of these three to implement. It producesthe point at which the possibility distribution of the action reaches a max-imum value. Reader may please refer to [7] for more information related tofuzzy systems. It is typically advantageous if the fuzzy rule base is adaptiveto a certain application. The fuzzy rule base is usually constructed manuallyor by automatic adaptation by some learning techniques using evolutionaryalgorithms and/or neural network learning methods [6].

Fuzzy Image Processing

The adoption of the fuzzy paradigm is desirable in image processing becauseof the uncertainty and imprecision present in images, due to noise, image sam-pling, lightning variations and so on. Fuzzy theory provides a mathematical


tool to deal with the imprecision and ambiguity in an elegant and efficientway. Fuzzy techniques can be applied to different phases of the segmentationprocess; additionally, fuzzy logic allows to represent the knowledge about thegiven problem in terms of linguistic rules with meaningful variables, whichis the most natural way to express and interpret information. Fuzzy imageprocessing [10, 68, 73, 102, 112] is the collection of all approaches that under-stand, represent and process the images, their segments and features as fuzzysets. An image I of size M×N and L gray levels can be considered as an arrayof fuzzy singletons, each having a value of membership denoting its degree ofbrightness relative to some brightness levels. For an image I, we can write inthe notation of fuzzy sets:

I =⋃MN

µ(gmn)gmn

, (7)

where gmn is the intensity of (m,n) the pixel and µmn its membership value.The membership function characterizes a suitable property of image (e.g.,

edginess, darkness, textural property) and can be defined globally for thewhole image or locally for its segments. In recent years, some researchershave applied the concept of fuzziness to develop new algorithms for imageprocessing tasks, for example image enhancement, segmentation, etc. Fuzzyimage processing system is a rule-based system that uses fuzzy logic to reasonabout image data. Its basic structure consists of four main components, asdepicted in Fig. 5.

Fig. 5. Fuzzy image processing system [10]


• The coding of image data (fuzzifier), which translates gray-level plane tothe membership plane

• An inference engine that applies a fuzzy reasoning mechanism to obtain afuzzy output

• Decoding the result of fuzzification (defuzzifier), which translates this lat-ter output into a gray-level plane; and

• Knowledge base, which contains both an ensemble of fuzzy rules, knownas the rule base, and an ensemble of membership functions known as thedatabase

The decision-making process is performed by the inference engine usingthe rules contained in the rule base. These fuzzy rules define the connectionbetween input and output fuzzy variables. The inference engine evaluates allthe rules in the rule base and combines the weighted consequents of all relevantrules into a single output fuzzy set.

2.5 Evolutionary Algorithms

Evolutionary algorithms (EA) are adaptive methods, which may be used tosolve search and optimization problems, based on the genetic processes ofbiological organisms. Over many generations, natural populations evolve ac-cording to the principles of natural selection and ‘survival of the fittest’, firstclearly stated by Charles Darwin in The Origin of Species. By mimicking thisprocess, evolutionary algorithms are able to ‘evolve’ solutions to real worldproblems, if they have been suitably encoded [12]. Usually grouped underthe term evolutionary algorithms or evolutionary computation, we find thedomains of genetic algorithms [15, 16], evolution strategies [21], evolutionaryprogramming [11], genetic programming [18] and learning classifier systems.They all share a common conceptual base of simulating the evolution of indi-vidual structures via processes of selection, mutation, and reproduction. Theprocesses depend on the perceived performance of the individual structuresas defined by the environment (problem).

EA’s deal with parameters of finite length, which are coded using a fi-nite alphabet, rather than directly manipulating the parameters themselves.This means that the search is unconstrained neither by the continuity of thefunction under investigation, nor the existence of a derivative function.

Figure 6 depicts the functional block diagram of a Genetic Algorithm (GA)and the various aspects are discussed below. It is assumed that a potentialsolution to a problem may be represented as a set of parameters. These para-meters (known as genes) are joined together to form a string of values (knownas a chromosome). A gene (also referred to a feature, character or detector)refers to a specific attribute that is encoded in the chromosome. The particularvalues the genes can take are called its alleles. The position of the gene in thechromosome is its locus. Encoding issues deal with representing a solution in achromosome and unfortunately, no one technique works best for all problems.


Fig. 6. The functional block diagram of a genetic algorithm

A fitness function must be devised for each problem to be solved. Given a par-ticular chromosome, the fitness function returns a single numerical fitness orfigure of merit, which will determine the ability of the individual, which thatchromosome represents. Reproduction is the second critical attribute of GA’swhere two individuals selected from the population are allowed to mate to pro-duce offspring, which will comprise the next generation. Having selected twoparents, their chromosomes are recombined, typically using the mechanismsof crossover and mutation.

There are many ways in which crossover can be implemented. In a sin-gle point crossover two chromosome strings are cut at some randomly chosenposition, to produce two ‘head’ segments, and two ‘tail’ segments. The tailsegments are then swapped over to produce two new full-length chromosomes.Crossover is not usually applied to all pairs of individuals selected for mat-ing. Another genetic operation is mutation, which is an asexual operationthat only operates on one individual. It randomly alters each gene with asmall probability. Traditional view is that crossover is the more important ofthe two techniques for rapidly exploring a search space. Mutation provides asmall amount of random search, and helps ensure that no point in the searchspace has a zero probability of being examined. If the GA has been correctlyimplemented, the population will evolve over successive generations so thatthe fitness of the best and the average individual in each generation increasestowards the global optimum. Selection is the survival of the fittest withinGA’s. It determines which individuals are to survive to the next generation.The selection phase consists of three parts. The first part involves determi-nation of the individual’s fitness by the fitness function. A fitness functionmust be devised for each problem; given a particular chromosome, the fitnessfunction returns a single numerical fitness value, which is proportional to theability, or utility, of the individual represented by that chromosome. For manyproblems, deciding upon the fitness function is very straightforward, for ex-ample, for a function optimization search; the fitness is simply the value ofthe function. Ideally, the fitness function should be smooth and regular sothat chromosomes with reasonable fitness are close in the search space, tochromosomes with slightly better fitness. However, it is not always possibleto construct such ideal fitness functions. The second part involves convertingthe fitness function into an expected value followed by the last part wherethe expected value is then converted to a discrete number of offspring. Some


of the commonly used selection techniques are roulette wheel and stochas-tic universal sampling. Genetic programming applies the GA concept to thegeneration of computer programs. Evolution programming uses mutations toevolve populations. Evolution strategies incorporate many features of the GAbut use real-valued parameters in place of binary-valued parameters. Learn-ing classifier systems use GAs in machine learning to evolve populations ofcondition/action rules.

2.6 Intelligent Paradigms: Probabilistic Computing and SwarmIntelligence

Probabilistic models are viewed as similar to that of a game, actions are basedon expected outcomes. The center of interest moves from the deterministic toprobabilistic models using statistical estimations and predictions. In the prob-abilistic modeling process, risk means uncertainty for which the probabilitydistribution is known. Therefore risk assessment means a study to determinethe outcomes of decisions along with their probabilities. Decision-makers of-ten face a severe lack of information. Probability assessment quantifies theinformation gap between what is known, and what needs to be known foran optimal decision. The probabilistic models are used for protection againstadverse uncertainty, and exploitation of propitious uncertainty.

Swarm intelligence is aimed at collective behavior of intelligent agents indecentralized systems. Although there is typically no centralized control dic-tating the behavior of the agents, local interactions among the agents oftencause a global pattern to emerge. Most of the basic ideas are derived fromthe real swarms in the nature, which includes ant colonies, bird flocking, hon-eybees, bacteria and microorganisms, etc. Ant Colony Optimization (ACO),have already been applied successfully to solve several engineering optimiza-tion problems. Swarm models are population-based and the population isinitialised with a population of potential solutions. These individuals are thenmanipulated (optimised) over many several iterations using several heuristicsinspired from the social behavior of insects in an effort to find the optimalsolution. Ant colony algorithms are inspired by the behavior of natural antcolonies, in the sense that they solve their problems by multi agent cooperationusing indirect communication through modifications in the environment. Antsrelease a certain amount of pheromone (hormone) while walking, and each antprefers (probabilistically) to follow a direction, which is rich of pheromone.This simple behavior explains why ants are able to adjust to changes in theenvironment, such as optimizing shortest path to a food source or a nest. InACO, ants use information collected during past simulations to direct theirsearch and this information is available and modified through the environ-ment. Recently ACO algorithms have also been used for clustering data sets.


3 Computational Intelligence on Speech, Audioand Image Processing

Computational intelligence techniques are being used for processing speech,audio and image for several years [59, 64, 98]. Some of the applications inspeech processing where computational intelligences are extensively used in-clude speech recognition, speaker recognition, speech enhancement, speechcoding and speech synthesis; in audio processing, computational intelligenceare used for speech/music classification, audio classification and audio in-dexing and retrieval; while in image processing include image enhancement,segmentation, classification, registration, motion detection, etc. For example,Vladimir et al. [17] proposed a fuzzy logic recursive scheme for motion de-tection and spatiotemporal filtering that can deal with the Gaussian noiseand unsteady illumination conditions in both the temporal and spatial di-rections. Our focus is on applications concerning tracking and de-noising ofimage sequences. An input noisy sequence is processed with fuzzy logic mo-tion detection to determine the degree of motion confidence. The proposedmotion detector combines the membership of the temporal intensity changes,appropriately using fuzzy rules, where the membership degree of motion foreach pixel in a 2D sliding window is determined by a proposed membershipfunction. Both the fuzzy membership function and the fuzzy rules are de-fined in such a way that the performance of the motion detector is optimizedin terms of its robustness to noise and unsteady lighting conditions. Track-ing and recursive adaptive temporal filtering are simultaneously performed,where the amount of filtering is inversely proportional to the confidence in theexistence of motion. Finally, temporally filtered frames are further processedby a proposed spatial filter to obtain a de-noised image sequence. The pro-posed motion detection algorithm have been evaluated using two criteria: (1)robustness to noise and to changing illumination conditions and (2) motionblur in temporal recursive de-noising.

Speech and Audio Processing

Speech processing is the study of speech signals and the processing methodsof these signals. The signals are usually processed in a digital representationwhereby speech processing can be seen as the intersection of digital signalprocessing and natural language processing.It can be divided in the followingcategories: (1) Speech recognition, which deals with analysis of the linguis-tic content of a speech signal; (2) Speaker recognition, where the aim is torecognize the identity of the speaker; (3) Enhancement of speech signals, e.g.,audio noise reduction, Speech coding, a specialized form of data compression,is important in the telecommunication area; (4) Voice analysis for medicalpurposes, such as analysis of vocal loading and dysfunction of the vocal cords;(5) Speech synthesis (i.e., the artificial synthesis of speech), which usuallymeans computer generated speech; and (6) Speech enhancement, which deals


with enhancing the perceptual quality of speech signal by removing the de-structive effects of noise, limited capacity recording equipment, impairments,etc. Reader may refer to [64] for an extensive overview of the advances onpattern recognition for speech and audio processing.

The feasibility of converting text into speech using an inexpensive com-puter with minimal memory is of great interest. Speech synthesizers havebeen developed for many popular languages (e.g., English, Chinese, Span-ish, French, etc.), but designing a speech synthesizer for a language is largelydependant on the language structure. Text-to-speech conversion has tradi-tionally been performed either by concatenating short samples of speech orby using rule-based systems to convert a phonetic representation of speechinto an acoustic representation, which is then converted into speech. Karaaliet al. [56] described a system that uses a Time-Delay Neural Network (TDNN)to perform this phonetic-to-acoustic mapping, with another neural networkto control the timing of the generated speech. The neural network system re-quires less memory than a concatenation system, and performed well in testscomparing it to commercial systems using other technologies. It is reportedthat the neural network approach to speech synthesis offers the benefits oflanguage portability, natural sounding speech, and low storage requirementsas well as provide better voice quality than traditional approaches.

Hendessi et al. [55] developed a Persian synthesizer that includes an in-novative text analyzer module. In the synthesizer, the text is segmented intowords and after preprocessing, a neural network is passed over each word. Inaddition to preprocessing, a new model (SEHMM) is used as a post-processorto compensate for errors generated by the neural network. The performanceof the proposed model is verified and the intelligibility of the synthetic speechis assessed via listening tests.

The use of neural networks to synthesize speech from a phonetic represen-tation and to generate a frame of input to a vocoder. This requires the neuralnetwork to compute one output for each frame of speech from the vocoder,this can be computationally expensive. Corrigan et al. [57] introduced an al-ternative implementation to model the speech as a series of gestures, andlet the neural network generate parameters describing the transitions of thevocoder parameters during these gestures. Their experiments have shown thatacceptable speech quality is produced when each gesture is half of a phoneticsegment and the transition model is a set of cubic polynomials describingthe variation of each vocoder parameter during the gesture. Empirical resultsreveal a significant reduction in the computational cost.

Frankel et al. [60] described a speech recognition system which uses ar-ticulatory parameters as basic features and phone-dependent linear dynamicmodels. The system first estimates articulatory trajectories from the speechsignal. Estimations of x and y coordinates of seven actual articulator posi-tions in the midsagittal plane are produced every 2 ms by a recurrent neuralnetwork, trained on real articulatory data. The output of this network is thenpassed to a set of linear dynamic models, which perform phone recognition.


In recent years, the features derived from posteriors of a Multilayer Per-ceptron (MLP), known as tandem features, have proven to be very effectivefor automatic speech recognition. Most tandem features to date have reliedon MLPs trained for phone classification. Cetin et al. [105] illustrated on arelatively small data set that MLPs trained for articulatory feature classifica-tion can be equally effective. They provided a similar comparison using MLPstrained on a much larger data set – 2,000 h of English conversational tele-phone speech. Also, authors explored how portable phone- and articulatoryfeature- based tandem features are in an entirely different language – Man-darin – without any retraining. It is reported that while phone-based featuresperform slightly better in the matched-language condition, they perform sig-nificantly better in the cross-language condition. Yet, in the cross-languagecondition, neither approach is as effective as the tandem features extractedfrom an MLP trained on a relatively small amount of in-domain data. Be-yond feature concatenation, Cerin et al. explored novel observation modelingschemes that allow for greater flexibility in combining the tandem and stan-dard features at hidden Markov model (HMM) outputs.

Halavati et al. [42] presents a novel approach to speech recognition usingfuzzy modeling. The task begins with conversion of speech spectrogram into alinguistic description based on arbitrary colors and lengths. While phonemesare also described using these fuzzy measures, and recognition is done bynormal fuzzy reasoning, a genetic algorithm optimizes phoneme definitions sothat to classify samples into correct phonemes. The method is tested over astandard speech data base and the results are presented.

One of the factors complicating activity with speech signals is its largedegree of acoustic variability. To decrease influence of acoustic variability ofspeech signals, it is offered to use genetic algorithms in speech processingsystems. Bovbel and Tsishkoual [43] constructed a model which implementsthe technology of speech recognition using genetic algorithms. They madeexperiments on their model with a database of separated Belarussian wordsand achieved optimal results.

Ding [49] presented a fuzzy control mechanism for conventional MaximumLikelihood Linear Regression (MLLR) speaker adaptation, called FLC-MLLR,by which the effect of MLLR adaptation is regulated according to the avail-ability of adaptation data in such a way that the advantage of MLLR adap-tation could be fully exploited when the training data are sufficient, or theconsequence of poor MLLR adaptation would be restrained otherwise. Therobustness of MLLR adaptation against data scarcity is thus ensured. It isreported that the proposed mechanism is conceptually simple and computa-tionally inexpensive and effective; the experiments in recognition rate showthat FLC-MLLR outperforms standard MLLR especially when encounteringdata insufficiency and performs better than MAPLR at much less comput-ing cost.

Kostek and Andrzej [47] discussed some limitations of the hearing-aid fit-ting process. In the fitting process, an audiologist performs tests on the wearer


of the hearing aid, which is then adjusted based on the results of the test,with the goal of making the device work as best as it can for that individual.Traditional fitting procedures employ specialized testing devices which useartificial test signals. Ideally, however, the fitting of hearing aids should alsosimulate real-world conditions, such as listening to speech in the presence ofbackground noise. Therefore, more satisfying and reliable fitting tests may beachieved through the use of multimedia computers equipped with a properlycalibrated sound system. Kostek and Andrzej developed a new automatic sys-tem for fitting hearing aids. It employed fuzzy logic and a computer makeschoices for adjusting the hearing aid’s settings by analyzing the patient’s re-sponses and answering questions with replies that can lie somewhere betweena simple yes or no.

With the increase in access to multimedia computers, speech training canbe made available to patients with no continuous assistance required fromspeech therapists. Another function the system can easily perform is screeningtesting of speech fluency providing directed information to patients who havevarious speech disorders and problems with understanding speech. Andrzejet al. [51] programmed speech therapy training algorithm consisting of diag-nostic tools and rehabilitation devices connected with it. The first functionthe system has to perform is data acquisition where information about thepatient’s medical history is collected. This is done through electronic question-naires. The next function is analysis of the speech signal articulated by thepatient when prompted by the computer followed by some multimedia testscarried out in order to assess the subject’s ability to understand speech. Next,the results of the electronic questionnaire, the patient’s voice and patient’s re-actions are automatically analyzed. Based on that the system automaticallydiagnoses possible speech disorders and how strong they are. A large numberof school children were tested and reported.

The process of counting stuttering events could be carried out more ob-jectively through the automatic detection of stop-gaps, syllable repetitionsand vowel prolongations. The alternative would be based on the subjectiveevaluations of speech fluency and may be dependent on a subjective evalu-ation method. Meanwhile, the automatic detection of intervocalic intervals,stop-gaps, voice onset time and vowel durations may depend on the speakerand the rules derived for a single speaker might be unreliable when trying toconsider them as universal ones. This implies that learning algorithms havingstrong generalization capabilities could be applied to solve the problem. Nev-ertheless, such a system requires vectors of parameters, which characterize thedistinctive features in a subject’s speech patterns. In addition, an appropriateselection of the parameters and feature vectors while learning may augmentthe performance of an automatic detection system. Andrzej et al. [52] re-ported an automatic recognition of stuttered speech in normal and frequencyaltered feedback speech. It presents several methods of analyzing stutteredspeech and describes attempts to establish those parameters that represent


stuttering event. It also reports results of some experiments on automaticdetection of speech disorder events that were based on both rough sets andartificial neural networks.

Andrzej and Marek [54] presented a method for pitch estimation enhance-ment. Pitch estimation methods are widely used for extracting musical datafrom digital signal. A brief review of these methods is included in the pa-per. However, since processed signal may contain noise and distortions, theestimation results can be erroneous. The proposed method was developedin order to override disadvantages of standard pitch estimation algorithms.The introduced approach is based on both pitch estimation in terms of signalprocessing and pitch prediction based on musical knowledge modeling. First,signal is partitioned into segments roughly analogous to consecutive notes.Thereafter, for each segment an autocorrelation function is calculated. Au-tocorrelation function values are then altered using pitch predictor output.A music predictor based on artificial neural networks was introduced for thistask. The description of the proposed pitch estimation enhancement methodis included and some details concerning music prediction are discussed.

Liu et al. [48] proposed an improved hybrid support vector machine andduration distribution based hidden Markov (SVM/DDBHMM) decision fu-sion model for robust continuous digital speech recognition. The probabilityoutputs combination of Support Vector Machine and Gaussian mixture modelin pattern recognition (called FSVM), and embedding the fusion probabilityas similarity into the phone state level decision space of the Duration Distri-bution Based Hidden Markov Model (DDBHMM) speech recognition system(named FSVM/DDBHMM) were investigated. The performances of FSVMand FSVM/DDBHMM are demonstrated in Iris database and continuousmandarin digital speech corpus in four noise environments (white, volvo, bab-ble and destroyer-engine) from NOISEX-92. The experimental results showthe effectiveness of FSVM in Iris data, and the improvement of average worderror rate reduction of FSVM/DDBHMM from 6% to 20% compared withthe DDBHMM baseline at various signal noise ratios (SNRs) from −5 dB to30 dB by step of 5 dB.

Andrzej [50] investigated methods for the identification of direction ofthe incoming acoustical signal in the presence of noise and reverberation.Since the problem is a non-deterministic one, thus applications of two learningalgorithms, namely neural networks and rough sets were developed to solveit. Consequently, two sets of parameters were formulated in order to discerntarget source from unwanted sound source position and then processed bylearning algorithms. The applied feature extraction methods are discussed,training processes are described and obtained sound source localizing resultsare demonstrated and compared.

Kostek et al. [53] presented an automatic singing voice recognition us-ing neural network and rough sets. For this purpose a database containingsingers’ sample recordings has been constructed and parameters are extractedfrom recorded voices of trained and untrained singers of various voice types.


Parameters, which are especially designed for the analysis of the singing voiceare described and their physical interpretation is given. Decision systems basedon artificial neutral networks and rough sets are used for automatic voicetype/voice quality classification.

Limiting the decrease in performance due to acoustic environment changesremains a major challenge for continuous speech recognition (CSR) sys-tems. Selouani and Shaughnessy [25] proposed a hybrid enhancement noisereduction approach in the cepstral domain in order to get less-variant pa-rameters. It is based on the Karhunen–Loeve Transform (KLT) in the mel-frequency domain with a Genetic Algorithm (GA). The enhanced parametersincreased the recognition rate for highly interfering noise environments. Theproposed hybrid technique, when included in the front-end of an HTK-basedCSR system, outperformed the conventional recognition process in severe in-terfering car noise environments for a wide range of signal-to-noise ratios(SNRs) varying from 16 dB to −4 dB. They also showed the effectivenessof the KLT-GA method in recognizing speech subject to telephone channeldegradations.

CI in Speech Emotion Recognition

Speech emotion recognition is becoming more and more important in suchcomputer application fields as health care, children education, etc. Only fewworks have been done on speech emotion recognition using such methodsas ANN, SVM, etc., in the last years. Feature sets are broadly discussedwithin speech emotion recognition by acoustic analysis. While popular filterand wrapper based search help to retrieve relevant ones, we feel that auto-matic generation of such allows for more flexibility throughout search. Thebasis is formed by dynamic Low-Level Descriptors considering intonation, in-tensity, formants, spectral information and others. Next, systematic derivationof prosodic, articulatory, and voice quality high level functionals is performedby descriptive statistical analysis. From here on feature alterations are auto-matically fulfilled, to find an optimal representation within feature space inview of a target classifier. In addition, traditional feature selection methodused in speech emotion recognition is computationally too expensive to deter-mine an optimum or suboptimum feature subset. Focusing on these problems,many successful works have been addressed and discussed. For example, Zhouet al. [40] presented a novel approach based on rough set theory and SVM forspeech emotion recognition. The experiment results illustrated that the intro-duced approach can reduce the calculation cost while keeping high recognitionrate. Also, Schuller et al. [61] suggested the use of evolutionary programmingto avoid NP-hard exhaustive search.

Fellenz et al. [44] proposed a framework for the processing of face imagesequences and speech, using different dynamic techniques to extract appro-priate features for emotion recognition. The features were used by a hybrid


classification procedure, employing neural network techniques and fuzzy logic,to accumulate the evidence for the presence of an emotional expression of theface and the speaker’s voice.

Buscicchio et al. [19] proposed a biologically plausible methodology for theproblem of emotion recognition, based on the extraction of vowel informationfrom an input speech signal and on the classification of extracted informationby a spiking neural network. Initially, a speech signal is segmented into vowelparts which are represented with a set of salient features, related to the Mel-frequency cesptrum. Different emotion classes are then recognized by a spikingneural network and classified into five different emotion classes.

Audio–Visual Speech Recognition

Audio–Visual Speech Recognition (AVSR) [63] is a technique that uses im-age processing capabilities in lip reading to aid speech recognition systems inrecognizing undeterministic phones or giving preponderance among near prob-ability decisions. A great interest in the research of AVSR systems is drivenby the increase in the number of multimedia applications that require robustspeech recognition systems. The use of visual features in AVSR is justified byboth the audio and visual modality of the speech generation and the need forfeatures that are invariant to acoustic noise perturbation. The performance ofthe AVSR system relies on a robust set of visual features obtained from theaccurate detection and tracking of the mouth region. Therefore the mouthtracking plays a major role in AVSR systems. Moreover, A human listenercan use visual cues, such as lip and tongue movements, to enhance the levelof speech understanding, especially in a noisy environment. The process ofcombining the audio modality and the visual modality is referred to as speechreading, or lip reading. There are many applications in which it is desired torecognize speech under extremely adverse acoustic environments. Detecting aperson’s speech from a distance or through a glass window, understanding aperson speaking among a very noisy crowd of people, and monitoring a speechover TV broadcast when the audio link is weak or corrupted, are some exam-ples. Computational intelligence techniques plays an important role in this re-search direction. A number of CI-based AVSR methods have been proposed inthe literature. For example, Lim et al. [39] presented an improvement versionof mouth tracking technique using radial basis function neural network (RBFNN) with its applications to AVSR systems. A modified extended Kalmanfilter (EKF) was used to adjust the parameters of the RBF NN. Simulationresults have revealed good performance of the proposed method.

Automatic Speech Recognition (ASR) performs well under restricted con-ditions, but performance degrades in noisy environments. AVSR combats thisby incorporating a visual signal into the recognition. Lewis and Powers [62]discussed how to improve the performance of a standard speech recognitionsystems by using information from the traditional, auditory signals as well


as a visual signals. Using a knowledge from psycholinguistics, a late integra-tion network was developed that fused the automatic and visual sources. Animportant first step in AVSR is that of feature extraction from the mouthregion and a technique developed by the authors is briefly presented. Authorsexamined how useful this extraction technique in combination with severalintegration architectures is at the given task, demonstrates that vision doesin fact assist speech recognition when used in a linguistically guided fashion,and gives insight remaining issues.

Alessandro et al. [38] focused the attention on the problem of audio clas-sification in speech and music for multimedia applications. In particular, theypresented a comparison between two different techniques for speech/musicdiscrimination. The first method is based on zero crossing rate and Bayesianclassification. It is very simple from a computational point of view, and givesgood results in case of pure music or speech. The simulation results showthat some performance degradation arises when the music segment containsalso some speech superimposed on music, or strong rhythmic components. Toovercome these problems, they proposed a second method, that uses more fea-tures, and is based on neural networks (specifically a multi-layer Perceptron).It is reported that the introduced algorithm is obtain better performance, atthe expense of a limited growth in the computational complexity. In practice,the proposed neural network is simple to be implemented if a suitable poly-nomial is used as the activation function, and a real-time implementation ispossible even if low-cost embedded systems are used.

Speech recognition techniques have been developed dramatically in recentyears. Nevertheless, errors caused by environmental noise are still a seriousproblem in recognition. Employing algorithms to detect and follow the motionof lips have been widely used to improve the performance of speech recogni-tion algorithms. Vahideh and Yaghmaie [65] presented a simple and efficientmethod for extraction of visual features of lip to recognize vowels based onthe neural networks. The accuracy is verified by using it to recognize six mainFarsi vowels.

Faraj and Bigun [41] described a new identity authentication technique bya synergetic use of lip-motion and speech. The lip-motion is defined as the dis-tribution of apparent velocities in the movement of brightness patterns in animage and is estimated by computing the velocity components of the structuretensor by 1D processing, in 2D manifolds. Since the velocities are computedwithout extracting the speaker’s lip-contours, more robust visual features canbe obtained in comparison to motion features extracted from lip-contours. Themotion estimations are performed in a rectangular lip-region, which affordsincreased computational efficiency. A person authentication implementationbased on lip-movements and speech is presented along with experiments ex-hibiting a recognition rate of 98%. Besides its value in authentication, thetechnique can be used naturally to evaluate the liveness of someone speakingas it can be used in text-prompted dialogue. The XM2VTS database was used


for performance quantification as it is currently the largest publicly availabledatabase (300 persons) containing both lip-motion and speech. Comparisonswith other techniques are presented.

Shan Meng and Youwei Zhang [58] described a method of visual speechfeature area localization First, they propose a simplified human skin colormodel to segment input images and estimate the location of human face.Authors proposed a new localization method that is a combination of SVMand Distance of Likelihood in Feature Space (DLFS) derived from KernelPrincipal Component Analysis (KPCA). Results show that the introducedmethod outperformed traditional linear ones. All experiments were based onChinese Audio–Visual Speech Database(CAVSD).

4 Computational Intelligence in Video Processing

Edge extraction, texture classification, face recognition, character recognition,finger print identification, image/video enhancement, image/video segmenta-tion and clustering, and image/video coding are some of the applications ofcomputational intelligence in image processing. Here we demonstrated somereported examples of using the CI techniques in multimedia processing and,in particulars in image/video processing. As a result, there has been muchrecent research interest in this area. Many successful work towered this issuehas been addressed and discussed. Here, we review some successful work toillustrate how CI could be applied to resolve video segmentation problem.

Computational Intelligence in Video Segmentation

Successful video segmentation is necessary for most multimedia applications.In order to analyze a video sequence, it is necessary to break it down intomeaningful units that are of smaller length and have some semantic coherence.Video segmentation is the process of dividing a sequence of frames into smallermeaningful units that represent information at the scene level. This processserves as a fundamental step towards any further analysis on video frames forcontent analysis. In the past, several statistical methods that compare framedifferences have been published in literature and a range of similarity measuresbetween frames based on gray-scale intensity, color and texture have beenproposed. Here we demonstrate a succuss works in using the CI techniques invideo segmentation.

The organization of video information in video databases requires auto-matic temporal segmentation with minimal user interaction. As neural net-works are capable of learning the characteristics of various video segmentsand clustering them accordingly. Cao and Suganthan [27] developed a neuralnetwork based technique to segment the video sequence into shots automat-ically and with a minimum number of user-defined parameters. They pro-pose to employ Growing Neural Gas (GNG) networks and integrate multiple


frame difference features to efficiently detect shot boundaries in the video.Experimental results were presented to illustrate the good performance of theproposed scheme on real video sequences.

Lo and Wang [26] proposed a video segmentation method using aHistogram-Based Fuzzy C-Means (HBFCM) clustering algorithm. This algo-rithm is a hybrid of two approaches and is composed of three phases: thefeature extraction phase, the clustering phase, and the key-frame selectionphase. In the first phase, differences between color histogram are extractedas features. In the second phase, the Fuzzy C-Means (FCM) is used to groupfeatures into three clusters: the shot change (SC) cluster, the Suspected ShotChange (SSC) cluster, and the No Shot Change (NSC) cluster. In the lastphase, shot change frames are identified from the SC and the SSC, and thenused to segment video sequences into shots. Finally, key frames are selectedfrom each shot. Authors simulation results indicate that the HBFCM cluster-ing algorithm is robust and applicable to various types of video sequences.

Ford [20] presented a fuzzy logic system for the detection and classificationof shot boundaries in uncompressed video sequences. It integrates multiplesources of information and knowledge of editing procedures to detect shotboundaries. Furthermore, the system classifies the editing process employedto create the shot boundary into one of the following categories: abrupt cut,fade-in, fade-out, or dissolve. This system was tested on a database containinga wide variety of video classes. It achieved combined recall and precision ratesthat significantly exceed those of existing threshold-based techniques, and itcorrectly classified a high percentage of the detected boundaries.

Video temporal segmentation is normally the first and important step forcontent-based video applications. Many features including the pixel difference,color histogram, motion, and edge information, etc., have been widely usedand reported in the literature to detect shot cuts inside videos. Although ex-isting research on shot cut detection is active and extensive, it still remainsa challenge to achieve accurate detection of all types of shot boundaries withone single algorithm. Hui Fang et al. [24] proposed a fuzzy logic approach tointegrate hybrid features for detecting shot boundaries inside general videos.The fuzzy logic approach contains two processing modes, where one is dedi-cated to detection of abrupt shot cuts including those short dissolved shots,and the other for detection of gradual shot cuts. These two modes are unifiedby a mode-selector to decide which mode the scheme should work on in orderto achieve the best possible detection performances. By using the publiclyavailable test data set from Carleton University, extensive experiments werecarried out and the test results illustrate that the proposed algorithm out-performs the representative existing algorithms in terms of the precision andrecall rates.

Mitra [71] proposed an evolutionary rough c-means clustering algorithm.Genetic algorithms are employed to tune the threshold, and relative impor-tance of upper and lower approximations of the rough sets modeling theclusters. The Davies–Bouldin clustering validity index is used as the fitness


function, that is minimized while arriving at an optimal partitioning. A com-parative study of its performance is made with related partitive algorithms.The effectiveness of the algorithm is demonstrated on real and synthetic datasets, including microarray gene expression data from Bioinformatics. In thesame study, the author noted that the parameter threshold measures the rela-tive distance of an object Xk from a pair of clusters having centroids ceni andcenj . The smaller the value of threshold, the more likely is Xk to lie withinthe rough boundary (between upper and lower approximations) of a cluster.This implies that only those points which definitely belong to a cluster (lieclose to the centroid) occur within the lower approximation. A large value ofthreshold implies a relaxation of this criterion, such that more patterns areallowed to belong to any of the lower approximations. The parameter wlow

controls the importance of the objects lying within the lower approximationof a cluster in determining its centroid. A lower wlow implies a higher wup,and hence an increased importance of patterns located in the rough boundaryof a cluster towards the positioning of its centroid.

Das et al. [103] presented a framework to hybridize the rough set theorywith particle swarm optimization algorithm. The hybrid rough-PSO techniquehas been used for grouping the pixels of an image in its intensity space. Medicalimages become corrupted with noise very often. Fast and efficient segmenta-tion of such noisy images (which is essential for their further interpretation inmany cases) has remained a challenging problem for years. In there work, theytreat image segmentation as a clustering problem. Each cluster is modeledwith a rough set. PSO is employed to tune the threshold and relative impor-tance of upper and lower approximations of the rough sets. Davies-Bouldinclustering validity index is used as the fitness function, which is minimizedwhile arriving at an optimal partitioning.

Raducanu et al. [106] proposed a Morphological Neural Networks (MNN)algorithm as associative (with its two cases: autoassociative and heteroassocia-tive) memories. It propose their utilization as a preprocessing step for humanshape detection, in a vision-based navigation problem for mobile robots. It isreported that the MNN can be trained in a single computing step, they pos-sess unlimited storing capacity, and they have perfect recall of the patterns.Recall is also very fast, because the MNN recall does not involve the searchfor an energy minimum.

Adaptation of C-Means to Rough Set Theory

C-means clustering is an iterative technique that is used to partition an im-age into C-clusters. Fuzzy C-Means (FCM) is one of the most commonlyused fuzzy clustering techniques for different degree estimation problems, es-pecially in medical image processing [104, 107, 116]. Lingras [70] describedmodifications of clustering based on Genetic Algorithms, K-means algorithm,and Kohonen Self-Organizing Maps (SOM). These modifications make it pos-sible to represent clusters as rough sets [97]. In their work, Lingras established


a rough k-means framework and extended the concept of c-means by viewingeach cluster as an interval or rough set [69]. Here is a brief summary of hispioneer clustering work.

K-means clustering is one of the most popular statistical clustering tech-niques used in segmentation of medical images [66,72,94,108–110]. The nameK-means originates from the means of the k clusters that are created fromn objects. Let us assume that the objects are represented by m-dimensionalvectors. The objective is to assign these n objects to k clusters. Each of theclusters is also represented by an m-dimensional vector, which is the centroidor mean vector for that cluster. The process begins by randomly choosing k ob-jects as the centroids of the k clusters. The objects are assigned to one of the kclusters based on the minimum value of the distance d(v, x) between the objectvector v = (v1, . . . , vj , . . . , vm) and the cluster vector x = (x1, . . . , xj , . . . , xm).After the assignment of all the objects to various clusters, the new centroidvectors of the clusters are calculated as

xj =∑

v∈x vj

SOC,where 1 ≤ j ≤ m , (8)

where SOC is the size of cluster x.Lingras [70] mentioned that incorporation of rough sets into K-means clus-

tering requires the addition of the concept of lower and upper bounds. Cal-culation of the centroids of clusters from conventional K-Means needs to bemodified to include the effects of lower as well as upper bounds. The modifiedcentroid calculations for rough sets are then given by:

cenj = Wlow ×∑

v∈R(x)

|R(x)| + wup ×∑

v∈(BNR(x))

|BNR(x)| , (9)

where 1 ≤ j ≤ m. The parameters wlower and w(upper) correspond to the rela-tive importance of lower and upper bounds, and wlow +wup = 1. If the upperbound of each cluster were equal to its lower bound, the clusters would beconventional clusters. Therefore, the boundary region BNR(x) will be empty,and the second term in the equation will be ignored. Thus, the above equa-tion will reduce to conventional centroid calculations. The next step in themodification of the K-means algorithms for rough sets is to design criteria todetermine whether an object belongs to the upper or lower bound of a clus-ter, for more details refer to. The main steps of the algorithm are provided inAlgorithm 1.

5 Computational Intelligence in MultimediaWatermarking

Multimedia watermarking technology has evolved very quickly during thelast few years. A digital watermark is information that is imperceptiblyand robustly embedded in the host data such that it cannot be removed.


Algorithm 1 Rough C-mean Algorithm1: Set xi as an initial means for the c clusters.2: Initialize the population of particles encoding parameters threshold and wlow

3: Initialize each data object xk to the lower approximation or upper approximationof clusters ci by computing the difference in its distance by:

diff = d(xk, ceni) − d(xk, cenj), (10)

where ceni and cenj are the cluster centroid pairs.4: if diff < δ then5: xk ∈ the upper approximation of the ceni and cenj clusters and can not be

in any lower approximation.6: Else7: xk ∈ lower approximation of the cluster ci such that distance d(xk, ceni) is

minimum over the c clusters.8: end if9: Compute a new mean using equation 15

10: repeatstatement 3-9

11: until convergence, i.e., there is no more new assignments

A watermark typically contains information about the origin, status, or re-cipient of the host data. The digital watermarking system essentially consistsof a watermark encoder and a watermark decoder. The watermark encoderinserts a watermark onto the host signal and the watermark decoder detectsthe presence of watermark signal. Note that an entity called watermark keyis used during the process of embedding and detecting watermarks. The wa-termark key has a one-to-one correspondence with watermark signal (i.e., aunique watermark key exists for every watermark signal). The watermark keyis private and known to only authorized parties and it ensures that only autho-rized parties can detect the watermark. Further, note that the communicationchannel can be noisy and hostile (i.e., prone to security attacks) and hence thedigital watermarking techniques should be resilient to both noise and securityattacks. Figure 7 illustrates the digital watermark methodology in general.

The development of watermarking methods involves several design trade-offs: (1) Robustness which deals with the ability of the watermark to resistattempts by an attacker to destroy it by modifying the size, rotation, qual-ity, or other visual aspects of the video; (2) Security which deals with theability of the watermark to resist attempts by a sophisticated attacker to re-move it or destroy it via cryptanalysis, without modifying the media itself;and (3) Perceptual fidelity the perceived visual quality of the marked me-dia compared to the original, unmarked video. A copyright protection is themost prominent application of watermarking techniques, others exist, includ-ing data authentication by means of fragile watermarks which are impairedor destroyed by manipulations, embedded transmission of value added ser-vices within multimedia data, and embedded data labeling for other purposes


Fig. 7. General digital watermarking architecture [9]

than copyright protection, such as data monitoring and tracking. An examplefor a data-monitoring system is the automatic registration and monitoring ofbroadcasted radio programs such that royalties are automatically paid to theIPR owners of the broadcast data. Focusing on these problems, many suc-cessful works have been addressed and discussed. For example, Lou et al. [32]proposed a copyright protection scheme based on chaos and secret sharingtechniques. Instead of modifying the original image to embed a watermarkin it, the proposed scheme extracts a feature from the image first. Then, theextracted feature and the watermark are scrambled by a chaos technique.Finally, the secret sharing technique is used to construct a shadow image.The watermark can be retrieved by performing an XOR operation betweenthe shadow images. It is reported that the introduced scheme compared withother works is secure and robust in resisting various attacks.

Cao et al. [37] proposed a novel audio watermarking algorithm based onneural networks. By transforming original audio sequence into 1D waveletdomain and selecting proper positions, several watermark bits were embed.Before transmitting, it effectively utilizes neural networks to learn the relationcharacteristics between original audio and watermarked audio. Due to thelearning and adaptive capabilities of neural networks possessing, the trainedneural networks almost exactly extract the watermark from the watermarkedaudio against audio processing attacks. Extensive experimental results showedthat the proposed method significantly possesses robustness. It is immuneagainst such attacks as low pass filtering, addition of noise, resampling andmedium filtering.


Wei Lu et al. [33] presented a robust digital image watermarking schemeby using neural network detector. First, the original image is divided into foursubimages by using subsampling. Then, a random binary watermark sequenceis embedded into DCT domain of these subimages. A fixed binary sequenceis added to the head of the payload watermark as the samples to train theneural network detector. Because of the good adaptive and learning abilities,the neural network detector can nearly exactly extract the payload watermark.Experimental results illustrated good performance of the proposed scheme onresisting common signal processing attacks.

Lou and Yin [30] proposed adaptive digital watermarking approach basedupon human visual system model and fuzzy clustering technique. The humanvisual system model is utilized to guarantee that the watermarked imageis imperceptible, while the fuzzy clustering approach has been employed toobtain the different strength of watermark by the local characters of image.In their experiments, the scheme provides a more robust and transparentwatermark.

Cheng-Ri Piao et al. [34] proposed a new watermarking scheme in whicha logo watermark is embedded into the Discrete Wavelet Transform (DWT)domain of the color image using Back-Propagation Neural networks (BPN).In order to strengthen the imperceptibility and robustness, the original im-age is transformed from RGB color space to brightness and chroma space(YCrCb). After transformation, the watermark is embedded into DWT co-efficient of chroma component, CrCb. A secret key determines the locationsin the image where the watermark is embedded. This process prevents pos-sible pirates from removing the watermark easily. BPN learns the charac-teristics of the color image, and then watermark is embedded and extractedby using the trained neural network. Experimental results showed that theproposed method has good imperceptibility and high robustness to commonimage processing attacks.

Zheng Liu et al. [35] introduced a sensor-based authentication watermark-ing with the concept of authentication on demand, in which user requirementsare adopted as parameters for authentication. In addition, fuzzy identificationof multiple authentication sensors outputs has introduced the ability of finelytuning on authentication type and degree. With this approach, authenticationsensitivity to malicious attacks is enhanced. It is reported that the introducedapproach is more robust against allowed modifications. In addition, author’salgorithm provide a new function, to detect the attack method.

Maher et al. [31] proposed a novel digital video watermarking scheme basedon multi resolution motion estimation and artificial neural network. A multiresolution motion estimation algorithm is adopted to preferentially allocatethe watermark to coefficients containing motion. In addition, embedding andextraction of the watermark are based on the relationship between a waveletcoefficient and its neighbor’s. A neural network was given to memorize therelationships between coefficients in a 3× 3 block of the image. Experimentalresults illustrated that embedding watermark where picture content is moving


is less perceptible. Further, empirical results demonstrated that the proposedscheme is robust against common video processing attacks.

Several discrete wavelet transform based techniques are used for water-marking digital images. Although these techniques are robust to some attacks,none of them is robust when a different set of parameters is used or some otherattacks (such as low pass filtering) are applied. In order to make the water-mark stronger and less susceptible to different types of attacks, it is essentialto find the maximum amount of watermark before the watermark becomesvisible. Davis and Najarian [111] used the neural networks to implement anautomated system of creating maximum-strength watermarks.

Diego and Manuel [29] proposed an evolutionary algorithm for the en-hancement of digital semi-fragile watermarking based on the manipulation ofthe image Discrete Cosine Transform (DCT). The algorithm searches for theoptimal localization of the DCT of an image to place the mark image DCTcoefficients. The problem is stated as a multi-objective optimization prob-lem (MOP), that involves the simultaneous minimization of distortion androbustness criteria.

Chang et al. [28] developed a novel transform domain digital watermark-ing scheme that uses visually meaningful binary image as watermark. Thescheme embeds the watermark information adaptively with localized embed-ding strength according to the noise sensitivity level of the host image. Fuzzyadaptive resonance theory (Fuzzy-ART) classification is used to identify ap-propriate locations for watermark insertion and its control parameters addagility to the clustering results to thwart counterfeiting attacks. The scal-ability of visually recognizable watermark is exploited to devise a robustweighted recovery method with composite watermark. The proposed water-marking schemes can also be employed for oblivious detection. Unlike mostoblivious watermarking schemes, our methods allow the use of visually mean-ingful image as watermark. For automation friendly verification, a normalizedcorrelation metric that suits well with the statistical property of their methodsis used. The experimental results demonstrated that the proposed techniquescan survive several kinds of image processing attacks and the JPEG lossycompression.

Tsai et al. [36] proposed a new intelligent audio watermarking methodbased on the characteristics of the HAS and the techniques of neural net-works in the DCT domain. The method makes the watermark imperceptibleby using the audio masking characteristics of the HAS. Moreover, the methodexploits a neural network for memorizing the relationships between the origi-nal audio signals and the watermarked audio signals. Therefore, the methodis capable of extracting watermarks without original audio signals. Finally,the experimental results are also included to illustrate that the method sig-nificantly possesses robustness to be immune against common attacks for thecopyright protection of digital audio.


6 Computational Intelligence in Content-BasedMultimedia Indexing and Retrieval

There are a growing number of applications, which extensively use the visualmedia. A key requirement in those applications is efficient access to the storedmultimedia information for the purposes of indexing, fast retrieval, and sceneanalysis. The amounts of multimedia content available to the public and toresearchers has been growing rapidly in the last decades and is expected toincrease exponentially in the years to come. This development puts a greatemphasis on automated content-based retrieval methods, which retrieve andindex multimedia based on its content. Such methods, however, suffer froma serious problem: the semantic gap, i.e., the wide gulf between the low-levelfeatures used by computer systems and the high-level concepts understood byhuman beings.

Mats et al. [46] proposed a method of content-based multimedia retrievalof objects with visual, aural and textual properties. In their method, train-ing examples of objects belonging to a specific semantic class are associatedwith their low-level visual descriptors (such as MPEG-7) and textual featuressuch as frequencies of significant keywords. A fuzzy mapping of a semanticclass in the training set to a class of similar objects in the test set is createdby using Self-Organizing Maps (SOMs) trained from automatically extractedlow-level descriptors. Authors performed several experiments with differenttextual features to evaluate the potential of their approach in bridging thegap from visual features to semantic concepts by the use textual presenta-tions. Their initial results show a promising increase in retrieval performance.PicSOM [45] content-based information retrieval (CBIR) system was usedwith video data and semantic classes from the NIST TRECVID 20051 eval-uation set. The TRECVID set contains TV broadcasts in different languagesand textual data acquired by using automatic speech recognition software andmachine translation where appropriate. Both the training and evaluation setsare were accompanied with verified semantic ground truth sets such as videosdepicting explosions or fire.

In the PicSOM system the videos and the parts extracted from thesewere arranged as hierarchical trees as shown in Fig. 8, with the main videoas the parent object and the different extracted media types as child objects.In this way, the relevance assessments can be transferred between relatedobjects in the PicSOM algorithm. From each media type different featureswere extracted, and Self-Organizing Maps were trained from these as is shownwith some examples in the Fig. 8.

Ming Li and Tong Wang [22] presented a new image retrieval techniquebased on concept lattices, named Concept Lattices- Based Image Retrieval,lattice browsing allows one to reach a group of images via one path. Becauseit can produce a lot of redundancies attributes when constructing the conceptlattices by using a general method. In addition, authors proposed a methodof attribute reduction of concept lattices based on discernibility matrix and


Fig. 8. The hierarchy of videos and examples of multi-modal SOMs [46]

boolean calculation to reduce the context of concept lattices. The scale of theproblem is reduced by using this method. At the same time, the efficiency ofimage retrieval is improved, which is reflected in the experiments.

Fuzzy set methods have been already applied to the representation offlexible queries and to the modeling of uncertain pieces of information in data-bases systems, as well as in information retrieval. This methodology seems tobe even more promising in multimedia databases which have a complex struc-ture and from which documents have to be retrieved and selected not onlyfrom their contents, but also from the idea the user has of their appearance,through queries specified in terms of user’s criteria. Dubois et al. [14] provideda preliminary investigation of the potential applications of fuzzy logic in mul-timedia databases. The problem of comparing semi-structured documents isfirst discussed. Querying issues are then more particularly emphasized. Theydistinguish two types of request, namely, those which can be handled withinsome extended version of an SQL-like language and those for which one hasto elicit user’s preference through examples.

Hassanien and Jafar [8] presented an application of rough sets to featurereduction, classification and retrieval for image databases in the frameworkof content-based image retrieval systems. The presented description of roughsets theory emphasizes the role of reducts in statistical feature selection, datareduction and rule generation in image databases. A key feature of the in-troduced approach is that segmentation and detailed object representationare not required. In order to obtain better retrieval results, the image texturefeatures can be combined with the color features to form a powerful discrimi-nating feature vector for each image. Texture features from the co-occurrencematrix are extracted, represented and, normalized in attribute vector then


the rough set dependency rules are generated directly from the real valueattribute vector. Then the rough set reduction technique is applied to findall reducts of the data which contains the minimal subset of attributes thatare associated with a class label for classification. A new similarity distancemeasure based on rough sets was presented. The classification and retrievalperformance are measured using recall-precision measure, as is standard in allcontent based image retrieval systems. Figure 9 illustrates the image classi-fication and retrieval scheme based on the rough set theory framework. (Seealso [114])

Chen and Wang [113] proposed a fuzzy logic approach UFM (UnifiedFeature Matching), for region-based image retrieval. In their retrieval sys-tem, an image is represented by a set of segmented regions, each of which is

Fig. 9. CBIR in rough sets frameworks [8]


characterized by a fuzzy feature (fuzzy set) reflecting color, texture, and shapeproperties. As a result, an image is associated with a family of fuzzy featurescorresponding to regions. Fuzzy features naturally characterize the gradualtransition between regions (blurry boundaries) within an image and incorpo-rate the segmentation-related uncertainties into the retrieval algorithm. Theresemblance of two images is then defined as the overall similarity betweentwo families of fuzzy features and quantified by a similarity measure, UFMmeasure, which integrates properties of all the regions in the images. Com-pared with similarity measures based on individual regions and on all regionswith crisp-valued feature representations, the UFM measure greatly reducesthe influence of inaccurate segmentation and provides a very intuitive quan-tification. The UFM has been implemented as a part of author’s experimentalimage retrieval system. The performance of the system was illustrated usingexamples from an image database of about 60,000 general-purpose images.

As digital video databases become more and more pervasive, finding videoin large databases becomes a major problem. Because of the nature of video(streamed objects), accessing the content of such databases is inherently atime-consuming operation. Kulkarni [23] proposed a neural-fuzzy based ap-proach for retrieving a specific video clip from a video database. Fuzzy logicwas used for expressing queries in terms of natural language and a neuralnetwork is designed to learn the meaning of these queries. The queries weredesigned based on features such as color and texture of shots, scenes andobjects in video clips. An error backpropagation algorithm was proposed tolearn the meaning of queries in fuzzy terms such as very similar, similar andsome-what similar. Preliminary experiments were conducted on a small videodatabase and different combinations of queries using color and texture featuresalong with a visual video clip; very promising results were achieved.

7 Conclusions, Challenges and Future Directions

During the last decades, multimedia processing has emerged as an importanttechnology to generate content based on images, video, audio, graphics, andtext. Furthermore, the recent new development represented by High Defini-tion Multimedia content and Interactive television will generate a huge vol-ume of data and important computing problems connected with the creation,processing and management of Multimedia content. Multimedia processing isa challenging domain for several reasons: it requires both high computationrates and memory bandwidth; it is a multirate computing problem; and re-quires low-cost implementations for high-volume markets. The past years havewitnessed a large number of interesting applications of various computationalintelligence techniques, such as neural networks; fuzzy logic; evolutionary com-puting; swarm intelligence; reinforcement Learning and evolutionary com-putation, rough sets, and a generalization of rough sets called near sets,to intelligent multimedia processing. Therefore, multimedia computing and


communication is another challenge and fruitful area for CI to play cru-cial roles in resolving problems and providing solutions to multimedia im-age/audio/video processing that understand, represent and process the media,their segments, indexing and retrieval.

Another challenge is to develop near sets-based methods, which offer ageneralization of traditional rough set theory and a approach to classifyingperceptual objects by means of features could be lead to new and will be use-ful in solving object recognition, particularly in solving multimedia problemssuch as classification and segmentation as well as to an application of thenear set approach in 2D and 3D interactive gaming with a vision system thatlearns and serves as the backbone for an adaptive telerehabilitation systemfor patients with finger, hand, arm and balance disabilities. Each remote nodein the telerehabilitation system includes a vision system that learns to trackthe behavior of a patient. Images deemed to be ‘interesting’ (e.g., images rep-resenting erratic behavior) are stored as well as forwarded to a rehabilitationcenter for follow up. In such a system, there is a need to identify images thatare in some sense near images representing some standard or norm. This re-search has led to a study of methods of automating image segmentation as afirst step in near set-based image processing.

In recent years, there has been a rapidly increasing demand for the de-velopment of advanced interactive multimedia applications, such as videotelephony, video games and TV broadcasting have resulted in spectacularstrides in the progress of wireless communication systems. However, theseapplications are always stringently constrained by current wireless system ar-chitectures because the request of high data rate for video transmission. Tobetter serve this need, 4G broadband mobile systems are being developed andare expected to increase the mobile data transmission rates and bring higherspectral efficiency, lower cost per transmitted bit, and increased flexibilityof mobile terminals and networks. The new technology strives to eliminatethe distinction between video over wireless and video over wireline networks.In the meantime, great opportunities are provided for proposing novel wire-less video protocols and applications, and developing advanced video codingand communications systems and algorithms for the next-generation videoapplications that can take maximum advantage of the 4G wireless systems.New video applications over 4G wireless systems is a challenge for the CIresearchers.

The current third generation (3G) wireless systems and the next gen-eration (4G) wireless systems in the development stages support higher bitrates. However, the high error rates and stringent delay constraints in wirelesssystems are still significant obstacles for these applications and services. Onthe other hand, the development of more advanced wireless systems providesopportunities for proposing novel wireless multimedia protocols and new ap-plications and services that can take the maximum advantage of the systems.

In mobile ad hoc networks, specific intrusion detection systems are neededto safeguard them since traditional intrusion prevention techniques are not


sufficient in the protection of mobile ad hoc networks [1]. Therefore, intrusiondetection system is another challenge and fruitful area for CI to play crucialroles in resolving problems and providing solutions to intrusion detection sys-tems and authenticate the maps produced by the application of the intelligenttechniques using watermarking, biometric and cryptology technologies.

A combination of kinds of computational intelligence techniques in appli-cation area of multimedia processing has become one of the most importantways of research of intelligent information processing. Neural network showsus its strong ability to solve complex problems for many multimedia process-ing. From the perspective of the specific rough sets approaches that need tobe applied, explorations into possible applications of hybridize rough sets withother intelligent systems like neural networks [96], genetic algorithms, fuzzyapproaches, etc., to multimedia processing and pattern recognition, in par-ticulars in multimedia computing problems could lead to new and interestingavenues of research and is always a challenge for the CI researchers.

In conclusion, many successful algorithms applied in multimedia process-ing have been reported in the literature and the applications of rough setsin multimedia processing have to be analyzed individually. Rough set is anew challenge to deal with the issues that can not be addressed by tradi-tional image processing algorithms or by other classification techniques. Byintroducing rough set, algorithms developed for multimedia processing andpattern recognition often become more intelligent and robust that provides ahuman-interpretable, low cost, exact enough solution, as compared to otherintelligence techniques.

Finally, the main purpose of this article is to present to the CI and mul-timedia research communities the state of the art in CI applications to mul-timedia computing, and to inspire further research and development on newapplications and new concepts in new trend-setting directions and in exploit-ing computational intelligence.

References

1. Abraham A., Jain R., Thomas J., and Han S.Y. (2007) D-SCIDS: Distributedsoft computing intrusion detection systems. Journal of Network and ComputerApplications, vol. 30, no. 1, pp. 81–98.

2. Bishop C.M. (1995) Neural Networks for Pattern Recognition. Oxford Univer-sity Press, Oxford.

3. Kohonen T. (1988) Self-Organization and Associative Memory. Springer, BerlinHeidelberg New York.

4. Carpenter G. and Grossberg S. (1995) Adaptive Resonance Theory (ART). In:Arbib M.A. (ed.), The Handbook of Brain Theory and Neural Networks. MIT,Cambridge, pp. 79–82.

5. Grossberg S. (1976) Adaptive pattern classification and universal recod-ing: Parallel development and coding of neural feature detectors. BiologicalCybernetics, vol. 23, pp. 121–134.


6. Abraham A. (2001) Neuro-fuzzy systems: State-of-the-art modeling techniques,connectionist models of neurons, learning processes, and artificial intelligence.In: Jose Mira and Alberto Prieto (eds.), Lecture Notes in Computer Science,vol. 2084, Springer, Berlin Heidelberg New York, pp. 269–276.

7. Nguyen H.T. and Walker E.A. (1999) A First Course in Fuzzy Logic. CRC,Boca Raton.

8. Hassanien A.E. and Jafar Ali (2003) Image classification and retrieval algo-rithm based on rough set theory. South African Computer Journal (SACJ),vol. 30, pp. 9–16.

9. Hassanien A.E. (2006) Hiding iris data for authentication of digital imagesusing wavelet theory. International journal of Pattern Recognition and ImageAnalysis, vol. 16, no. 4, pp. 637–643.

10. Hassanien A.E., Ali J.M., and Hajime N. (2004) Detection of spiculated massesin Mammograms based on fuzzy image processing. In: 7th International Con-ference on Artificial Intelligence and Soft Computing, ICAISC2004, Zakopane,Poland, 7–11 June. Lecture Notes in Artificial Intelligence, vol. 3070. Springer,Berlin Heidelberg New York, pp. 1002–1007.

11. Fogel L.J., Owens A.J., and Walsh M.J. (1967) Artificial Intelligence ThroughSimulated Evolution. Wiley, New York.

12. Fogel D.B. (1999) Evolutionary Computation: Toward a New Philosophy ofMachine Intelligence, 2nd edition. IEEE, Piscataway, NJ.

13. Pearl J. (1997) Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaufmann, San Francisco.

14. Dubois D., Prade H., and Sedes F. (2001) Fuzzy logic techniques in multi-media database querying: A preliminary investigation of the potentials. IEEETransactions on Knowledge and Data Engineering, vol. 13 , no. 3, pp. 383–392.

15. Holland J. (1975) Adaptation in Natural and Artificial Systems. University ofMichigan Press, Ann Harbor.

16. Goldberg D.E. (1989) Genetic Algorithms in Search, Optimization, andMachine Learning. Addison-Wesley, Reading.

17. Zlokolica V., Piurica A., Philips W., Schulte S., and Kerre E. (2006) Fuzzylogic recursive motion detection and denoising of video sequences. Journal ofElectronic Imaging, vol. 15, no. 2.

18. Koza J.R. (1992) Genetic Programming. MIT, Cambridge, MA.19. Buscicchio C.A., Grecki P., and Caponetti L. (2006) Speech emotion recog-

nition using spiking neural networks. In: Esposito F., Ras Z.W., Malerba D.,and Semeraro G. (eds.), Foundations of Intelligent Systems, Lecture Notes inComputer Science, vol. 4203, Springer, Berlin Heidelberg New York, pp. 38–46.

20. Ford R.M. (2005) Fuzzy logic methods for video shot boundary detection andclassification. In: Tan Y.-P., Yap K.H., and Wang L. (eds.) Intelligent Multime-dia Processing with Soft Computing, Studies in Fuzziness and Soft Computing,vol. 168, Springer, Berlin Heidelberg New York, pp. 151–169.

21. Back T. (1996) Evolutionary Algorithms in Theory and Practice: EvolutionStrategies, Evolutionary Programming, Genetic Algorithms. Oxford UniversityPress, New York.

22. Ming Li and Tong Wang (2005) An approach to image retrieval based on con-cept lattices and rough set theory. Sixth International Conference on Paralleland Distributed Computing, Applications and Technologies, 5–8 Dec., pp. 845–849.


23. Kulkarni S. (2004) Neural-fuzzy approach for content-based retrieval of digitalvideo. Canadian Conference on Electrical and Computer Engineering, vol. 4,2–5 May, pp. 2235–2238.

24. Hui Fang, Jianmin Jiang, and Yue Feng (2006) A fuzzy logic approachfor detection of video shot boundaries. Pattern Recognition, vol. 39, no. 11,pp. 2092–2100.

25. Selouani S.-A. and O’Shaughnessy D. (2003) On the use of evolutionary al-gorithms to improve the robustness of continuous speech recognition systemsin adverse conditions. EURASIP Journal on Applied Signal Processing, vol. 8,pp. 814–823

26. Lo C.-C. and Wang S.-J. (2001) Video segmentation using a histogram-basedfuzzy C-means clustering algorithm. The 10th IEEE International Conferenceon Fuzzy Systems, vol. 2, 2–5 Dec., pp. 920–923.

27. Cao X. and Suganthan P.N. (2002) Neural network based temporal videosegmentation. International Journal of Neural Systems, vol. 12, no. 3–4,pp. 263–629.

28. Chang C.-H., Ye Z., and Zhang M. (2005) Fuzzy-ART based adaptive digitalwatermarking scheme. IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 15, no. 1, pp. 65–81.

29. Diego Sal Diaz and Manuel Grana Romay (2005) Introducing a watermarkingwith a multi-objective genetic algorithm. Proceedings of the 2005 conference onGenetic and evolutionary computation, Washington DC, USA, pp. 2219–2220.

30. Lou D.-C. and Yin T.-L. (2001) Digital watermarking using fuzzy clusteringtechnique. IEICE Transactions on Fundamentals of Electronics, Communica-tions and Computer Sciences (Japan), vol. E84-A, no. 8, pp. 2052–2060.

31. Maher El-arbi, Ben Amar, and C. Nicolas, H. (2006) Video watermarking basedon neural networks. IEEE International Conference on Multimedia and Expo,Toronto, Canada, pp. 1577–1580.

32. Der-Chyuan Lou, Jieh-Ming Shieh, and Hao-Kuan Tso (2005) Copyrightprotection scheme based on chaos and secret sharing techniques. Optical En-gineering, vol. 44, no. 11, pp. 117004–117010.

33. Wei Lu, Hongtao Lu, and FuLai Chung (2005) Subsampling-based robust wa-termarking using neural network detector. Advances in Neural Networks, ISNN2005, Lecture Notes in Computer Science, vol. 3497, pp. 801–806.

34. Cheng-Ri Piao, Sehyeong Cho, and Seung-Soo Han (2006) Color image water-marking algorithm using BPN neural networks. Neural Information Processing,Lecture Notes in Computer Science, vol. 4234, pp. 234–242

35. Zheng Liu, Xue Li, and Dong Z. (2004) Multimedia authentication with sensor-based watermarking. Proc. of the international workshop on Multimedia andsecurity, Magdeburg, Germany, pp. 155–159

36. Hung-Hsu Tsai, Ji-Shiung Cheng, and Pao-Ta Yu (2003) Audio watermarkingbased on HAS and neural networks in DCT domain. EURASIP Journal onApplied Signal Processing, vol. 2003, no. 3, pp. 252–263

37. Cao L., Wang X., Wang Z., and Bai S. (2005) Neural network based audiowatermarking algorithm. In: ICMIT 2005: Information Systems and SignalProcessing, Wei Y., Chong K.T., Takahashi T. (eds.), Proceedings of the SPIE,vol. 6041, pp. 175–179

38. Alessandro Bugatti, Alessandra Flammini, and Pierangelo Migliorati (2002)Audio classification in speech and music: A comparison between a statisti-


cal and a neural approach. EURASIP Journal on Applied Signal Processing,vol. 2002, no. 4, pp. 372–378.

39. Lim Ee Hui, Seng K.P., and Tse K.M. (2004) RBF Neural network mouthtracking for audio–visual speech recognition system. IEEE Region 10 Confer-ence TENCON2004, 21–24 Nov., pp. 84–87.

40. Jian Zhou, Guoyin Wang, Yong Yang, and Peijun Chen (2006) Speech emotionrecognition based on rough set and SVM. 5th IEEE International Conferenceon Cognitive Informatics ICCI 2006, 17–19 July, vol. 1, pp. 53–61.

41. Faraj M.-I. and Bigun J. (2007) Audio–visual person authentication using lip-motion from orientation maps. Pattern Recognition Letters, vol. 28, no. 11,pp. 1368–1382.

42. Halavati R., Shouraki S.B., Eshraghi M., Alemzadeh M., and Ziaie P. (2004) Anovel fuzzy approach to speech recognition. Hybrid Intelligent Systems. FourthInternational Conference on Hybrid Intelligent Systems, 5–8 Dec., pp. 340–345.

43. Eugene I. Bovbel and Dzmitry V. Tsishkou (2000) Belarussian speech recogni-tion using genetic algorithms. Third International Workshop on Text, Speechand Dialogue, Brno, Czech Republic, pp. 185–204.

44. Fellenz W.A., Taylor J.G., Cowie R., Douglas-Cowie E., Piat F., Kollias S.,Orovas C., and Apolloni B. (2000) On emotion recognition of faces and ofspeech using neural networks, fuzzy logic and the ASSESS system. Proceedingsof the IEEE-INNS-ENNS International Joint Conference on Neural Networks,vol. 2, IJCNN 2000, pp. 93–98.

45. Laaksonen J., Koskela M., and Oja E. (2002) PicSOM-Self-organizing imageretrieval with MPEG-7 content descriptions. IEEE Transactions on NeuralNetworks, Special Issue on Intelligent Multimedia Processing vol. 13, no. 4,pp. 841–853.

46. Mats S., Jorma L., Matti P., and Timo H. (2006) Retrieval of multimediaobjects by combining semantic information from visual and textual descriptors.Proceedings of 16th International Conference on Artificial Neural Networks(ICANN 2006), pp. 75–83, Athens, Greece, September 2006.

47. Kostek B. and Andrzej C. (2001) Employing fuzzy logic and noisy speechfor automatic fitting of hearing aid. 142 Meeting of the Acoustical Society ofAmerica, No. 5, vol. 110, pp. 2680, Fort Lauderdale, USA.

48. Liu J., Wang Z., and Xiao X. (2007) A hybrid SVM/DDBHMM decision fusionmodeling for robust continuous digital speech recognition. Pattern RecognitionLetter, vol. 28, No. 8, pp. 912–920.

49. Ing-Jr Ding (2007) Incremental MLLR speaker adaptation by fuzzy logic con-trol. Pattern Recognition, vol. 40 , no. 11, pp. 3110–3119

50. Andrzej C. (2003) Automatic identification of sound source position employ-ing neural networks and rough sets. Pattern Recognition Letters, vol. 24,pp. 921–933.

51. Andrzej C., Kostek B., and Henryk S. (2002) Diagnostic system for speecharticulation and speech understanding. 144th Meeting of the Acoustical Societyof America (First Pan-American/Iberian Meeting on Acoustics), Journal of theAcoustical Society of America, vol. 112, no. 5, Cancun, Mexico.

52. Andrzej C., Andrzej K., and Kostek B. (2003) Intelligent processing ofstuttered speech. Journal of Intelligent Information Systems, vol. 21, no. 2,pp. 143–171.


53. Pawel Zwan, Piotr Szczuko, Bozena Kostek, and Andrzej Czyzewski (2007)Automatic singing voice recognition employing neural networks and rough sets.RSEISP 2007, pp. 793–802.

54. Andrzej C. and Marek S. (2002) Pitch estimation enhancement employingneural network-based music prediction. Proc. IASTED Intern. Conference, Ar-tificial Intelligence and Soft Computing, pp. 413–418, Banff, Canada.

55. Hendessi F., Ghayoori A., and Gulliver T.A. (2005) A speech synthesizerfor Persian text using a neural network with a smooth ergodic HMM. ACMTransactions on Asian Language Information Processing (TALIP), vol. 4, no. 1,pp. 38–52.

56. Orhan Karaali, Gerald Corrigan, and Ira Gerson (1996) Speech synthesis withneural networks. World Congress on Neural Networks, San Diego, Sept. 1996,pp. 45–50.

57. Corrigan G., Massey N., and Schnurr O. (2000) Transition-based speech syn-thesis using neural networks. Proceedings of IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 945–948.

58. Shan Meng and Youwei Zhang (2003) A method of visual speech feature arealocalization. Proceedings of the International Conference on Neural Networksand Signal Processing, 2003, vol. 2, 14–17 Dec., pp. 1173–1176.

59. Sun-Yuan Kung and Jenq-Neng Hwang (1998) Neural networks for intelligentmultimedia processing. Proceedings of the IEEE Workshop on Neural Net-worksm, vol. 86, no. 6, pp. 1244–1272.

60. Frankel J., Richmond K., King S., and Taylor P. (2000) An automatic speechrecognition system using neural networks and linear dynamic models to recoverand model articulatory traces. Proc. ICSLP, 2000.

61. Schuller B., Reiter S., and Rigoll G. (2006) Evolutionary feature generation inspeech emotion. IEEE International Conference on Recognition Multimedia,pp. 5–8.

62. Lewis T.W. and Powers D.M.W., Audio–visual speech recognition using red ex-clusion and neural networks. Proceedings of the twenty-fifth Australasian con-ference on Computer science, vol. 4, Melbourne, Victoria, Australia, pp. 149–156.

63. Nakamura S. (2002) Statistical multimodal integration for audio–visual speechprocessing. IEEE Transactions on Neural Networks, vol. 13, no. 4, pp. 854–866.

64. Guido R.C., Pereira J.C., and Slaets J.F.W. (2007) Advances on pattern recog-nition for speech and audio processing. Pattern Recognition Letters, vol. 28,no. 11, pp. 1283–1284.

65. Vahideh Sadat Sadeghi and Khashayar Yaghmaie (2006) Vowel recognitionusing neural networks. International Journal of Computer Science and NetworkSecurity (IJCSNS), vol. 6, no. 12, pp. 154–158.

66. Hartigan J.A. and Wong M.A. (1979) Algorithm AS136: A K-means clusteringalgorithm. Applied Statistics, vol. 28, pp. 100–108.

67. Henry C. and Peters J.F. (2007) Image pattern recognition using approxima-tion spaces and near sets. In: Proceedings of Eleventh International Conferenceon Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC2007), Joint Rough Set Symposium (JRS 2007), Lecture Notes in ArtificialIntelligence, vol. 4482, pp. 475–482.

68. Kerre E. and Nachtegael M. (2000) Fuzzy techniques in image processing:Techniques and applications. Studies in Fuzziness and Soft Computing, vol. 52,Physica, Heidelberg.


69. Lingras P. and West C. (2004) Interval set clustering of web users with roughK-means. Journal of Intelligent Information Systems, vol. 23, no. 1, pp. 5–16.

70. Lingras P. (2007) Applications of rough set based K-means, Kohonen, GAClustering. Transactions on Rough Sets, VII, pp. 120–139.

71. Mitra Sushmita (2004) An evolutionary rough partitive clustering. PatternRecognition Letters, vol. 25, pp. 1439–1449.

72. Ng H.P., Ong S.H., Foong K.W.C., Goh P.S., and Nowinski, W.L. (2006) Med-ical image segmentation using K-means clustering and improved watershedalgorithm. IEEE Southwest Symposium on Image Analysis and Interpretation,pp. 61–65.

73. Nachtegael M., Van-Der-Weken M., Van-De-Ville D., Kerre D., Philips W., andLemahieu I. (2001) An overview of classical and fuzzy-classical filters for noisereduction. 10th International IEEE Conference on Fuzzy Systems FUZZ-IEEE2001, Melbourne, Australia, pp. 3–6.

74. Ning S., Ziarko W., Hamilton J., and Cercone N. (1995) Using rough sets astools for knowledge discovery. In: Fayyad U.M. and Uthurusamy R. (eds.), FirstInternational Conference on Knowledge Discovery and Data Mining KDD’95,Montreal, Canada, AAAI, pp. 263–268.

75. Pawlak Z. (1991) Rough sets – Theoretical aspects of reasoning about data.Kluwer, Dordrecht.

76. Pawlak Z., Grzymala-Busse J., Slowinski R., and Ziarko W. (1995) Rough sets.Communications of the ACM, vol. 38, no. 11, pp. 88–95.

77. Polkowski L. (2003) Rough Sets: Mathematical Foundations. Physica,Heidelberg.

78. Peters J.F. (2007) Near sets: Special theory about nearness of objects. Funda-menta Informaticae, vol. 75, no. 1–4, pp. 407–433.

79. Peters J.F. (2007) Near sets. General theory about nearness of objects. AppliedMathematical Sciences, vol. 1, no. 53, pp. 2609–2029.

80. Peters J.F., Skowron A., and Stepaniuk J. (2007) Nearness of objects: Ex-tension of approximation space model. Fundamenta Informaticae, vol. 79,pp. 1–16.

81. Peters J.F. (2007) Near sets. Toward approximation space-based object recog-nition, In: Yao Y., Lingras P., Wu W.-Z, Szczuka M., Cercone N., Slezak D.(eds.), Proc. of the Second Int. Conf. on Rough Sets and Knowledge Tech-nology (RSKT07), Joint Rough Set Symposium (JRS07), Lecture Notes inArtificial Intelligence, vol. 4481, Springer, Berlin Heidelberg New York,pp. 22–33.

82. Peters J.F. and Ramanna S. (2007) Feature selection: Near set approach. In:Ras Z.W., Tsumoto S., and Zighed D.A. (eds.) 3rd Int. Workshop on Min-ing Complex Data (MCD’07), ECML/PKDD-2007, Lecture Notes in ArtificialIntelligence, Springer, Berlin Heidelberg New York, in press.

83. Peters J.F., Skowron A., and Stepaniuk J. (2006) Nearness in approximationspaces. In: Lindemann G., Schlilngloff H. et al. (eds.), Proc. Concurrency,Specification & Programming (CS&P’2006), Informatik-Berichte Nr. 206,Humboldt-Universitat zu Berlin, pp. 434–445.

84. Or�lowska E. (1982) Semantics of vague concepts. Applications of rough sets.Institute for Computer Science, Polish Academy of Sciences, Report 469,1982. See, also, Or�lowska E., Semantics of vague concepts, In: Dorn G. andWeingartner P. (eds.), Foundations of Logic and Linguistics. Problems andSolutions, Plenum, London, 1985, pp. 465–482.


85. Or�lowska E. (1990) Verisimilitude based on concept analysis. Studia Logica,vol. 49, no. 3, pp. 307–320.

86. Pawlak Z. (1981) Classification of objects by means of attributes. Institute forComputer Science, Polish Academy of Sciences, Report 429, 1981.

87. Pawlak Z. (1982) Rough sets. International Journal of Computing and Infor-mation Sciences, vol. 11, pp. 341–356.

88. Pawlak Z. and Skowron A. (2007) Rudiments of rough sets. InformationSciences, vol. 177, pp. 3–27.

89. Peters J.F. (2008) Classification of perceptual objects by means of features.International Journal of Information Technology and Intelligent Computing,vol. 3, no. 2, pp. 1–35.

90. Lockery D. and Peters J.F. (2007) Robotic target tracking with approxima-tion space-based feedback during reinforcement learning. In: Proceedings ofEleventh International Conference on Rough Sets, Fuzzy Sets, Data Miningand Granular Computing (RSFDGrC 2007), Joint Rough Set Symposium (JRS2007), Lecture Notes in Artificial Intelligence, vol. 4482, pp. 483–490.

91. Peters J.F., Borkowski M., Henry C., and Lockery D. (2006) Monocular vi-sion system that learns with approximation spaces. In: Ella A., Lingras P.,Slezak D., and Suraj Z. (eds.), Rough Set Computing: Toward Perception BasedComputing, Idea Group Publishing, Hershey, PA, pp. 1–22.

92. Peters J.F., Borkowski M., Henry C., Lockery D., Gunderson D., andRamanna S. (2006) Line-crawling bots that inspect electric power transmis-sion line equipment. Proc. 3rd Int. Conf. on Autonomous Robots and Agents2006 (ICARA 2006), Palmerston North, NZ, 2006, pp. 39–44.

93. Peters J.F. (2008) Approximation and perception in ethology-based reinforce-ment learning. In: Pedrycz W., Skowron A., and Kreinovich V. (eds.), Hand-book on Granular Computing, Wiley, New York, Ch. 30, pp. 1–41.

94. Peters J.F. and Borkowski M. (2004) K-means indiscernibility relation overpixels. Proc. 4th Int. Conf. on Rough Sets and Current Trends in Computing(RSCTC 2004), Uppsala, Sweden, 1–5 June, pp. 580–585.

95. Peters J.F. and Pedrycz W. (2007) Computational intelligence. In: EEEEncyclopedia. Wiley, New York, in press.

96. Peters J.F., Liting H., and Ramanna S. (2001) Rough neural computing insignal analysis. Computational Intelligence, vol. 17, no. 3, pp. 493–513.

97. Peters J.F., Skowron A., Suraj Z., Rzasa W., Borkowski M. (2002) Cluster-ing: A rough set approach to constructing information granules. Soft Comput-ing and Distributed Processing. Proceedings of 6th International Conference,SCDP 2002, pp. 57–61.

98. Petrosino A. and Salvi G. (2006) Rough fuzzy set based scale space trans-forms and their use in image analysis. International Journal of ApproximateReasoning, vol. 41, no. 2, pp. 212–228.

99. Shankar B.U. (2007) Novel classification and segmentation techniques withapplication to remotely sensed images. Transactions on Rough Sets, vol. VII,LNCS 4400, pp. 295–380.

100. Otto C.W. (2007) Motivating rehabilitation exercise using instrumented ob-jects to play video games via a configurable universal translation peripheral,M.Sc. Thesis, Supervisors: Peters J.F. and Szturm T., Department of Electricaland Computer Engineering, University of Manitoba, 2007.


101. Szturm T., Peters J.F., Otto C., Kapadia N., and Desai A. (2008) Task-specific rehabilitation of finger-hand function using interactive computergaming, Archives for Physical Medicine and Rehabilitation, submitted.

102. Sandeep Chandana and Rene V. Mayorga (2006) RANFIS: Rough adaptiveneuro-fuzzy inference system. International Journal of Computational Intelli-gence, vol. 3, no. 4, pp. 289–295.

103. Swagatam Das, Ajith Abraham, and Subir Kumar Sarkar (2006) A hybridrough set – Particle swarm algorithm for image pixel classification. Proceedingsof the Sixth International Conference on Hybrid Intelligent Systems, 13–15Dec., pp. 26–32.

104. Bezdek J.C., Ehrlich R., and Full W. (1984) FCM: The fuzzy C-means clus-tering algorithm. Computers and Geosciences, vol. 10, pp. 191–203.

105. Cetin O., Kantor A., King S., Bartels C., Magimai-Doss M., Frankel J., andLivescu K. (2007) An articulatory feature-based tandem approach and factoredobservation modeling. IEEE International Conference on Acoustics, Speech andSignal, ICASSP2007, Honolulu, HI, vol. 4, pp. IV-645–IV-648.

106. Raducanu B., Grana M., and Sussner P. (2001) Morphological neural networksfor vision based self-localization. IEEE International Conference on Roboticsand Automation, ICRA2001, vol. 2, pp. 2059–2064.

107. Ahmed M.N., Yamany S.M., Nevin M., and Farag A.A. (2003) A modifiedfuzzy C-means algorithm for bias field estimation and segmentation of MRIdata. IEEE Transactions on Medical Imaging, vol. 21, no. 3, pp. 193–199.

108. Yan M.X.H. and Karp J.S. (1994) Segmentation of 3D brain MR using anadaptive K-means clustering algorithm. IEEE Conference on Nuclear ScienceSymposium and Medical Imaging, vol. 4, pp. 1529–1533.

109. Voges K.E., Pope N.K.L.I., and Brown M.R. (2002) Cluster analysis of mar-keting data: A comparison of K-means, rough set, and rough genetic ap-proaches. In: Abbas H.A., Sarker R.A., and Newton C.S. (eds.), Heuristics andOptimization for Knowledge Discovery, Idea Group Publishing, pp. 208–216.

110. Chen C.W., Luo J.B., and Parker K.J. (1998) Image segmentation via adaptiveK-mean clustering and knowledge-based morphological operations with bio-medical applications. IEEE Transactions on Image Processing, vol. 7, no. 12,pp. 1673–1683.

111. Davis K.J. and Najarian K. (2001) Maximizing strength of digital watermarksusing neural networks. International Joint Conference on Neural Networks,IJCNN 2001, vol. 4, pp. 2893–2898.

112. Sankar K. Pal (2001) Fuzzy image processing and recognition: Uncertaintieshandling and applications. International Journal of Image and Graphics, vol. 1,no. 2, pp. 169–195.

113. Yixin Chen and James Z. Wang (2002) A region-based fuzzy feature match-ing approach to content-based image retrieval. IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 24, no. 9, pp. 1252–1267.

114. Yu Wanga Mingyue Dingb, Chengping Zhoub, and Ying Hub (2006) Interactiverelevance feedback mechanism for image retrieval using rough set. Knowledge-Based Systems, vol. 19, no. 8, pp. 696–703.

115. Zadeh L.A. (1965) Fuzzy sets. Information and Control, vol. 8, pp. 338–353.116. Zbigniew W. (1987) Rough approximation of shapes in pattern recognition.

Computer Vision, Graphics, and Image Processing, vol. 40, no. 2, pp. 228–249.

Computational Intelligence in MultimediaNetworking and Communications: Trendsand Future Directions

Parthasarathy Guturu

Electrical Engineering Department, University of North Texas, Denton,TX 76207-7102, [email protected]

This paper presents a review of the current literature on computational intel-ligence based approaches to various problems in multimedia networking andcommunications such as call admission control, management of resources andtraffic, routing, multicasting, media composition, encoding, media streamingand synchronization, and on-demand servers and services. Challenges to beaddressed and future directions of research are also presented.

1 Introduction

We currently live in an age of information revolution. With high impact ap-plications launched every day in various fields such as e-commerce, entertain-ment, education, medicine, defense, and homeland security, there has been anexplosive growth in the demand for exchange of various forms of information,text, graphics, audio, video, etc. collectively termed as multimedia. Colossalamounts of multimedia data that need to be transmitted over the Internet, inturn, necessitate smart multimedia communication methods with capabilitiesto manage resources effectively, reason under uncertainty, and handle impre-cise or incomplete information. To this end, many multimedia researchers inrecent times have developed computational intelligence (CI) based methodsfor various aspects of multimedia communications. The objective of this bookchapter is to present to the multimedia research community the state of theart in these CI applications to multimedia communications and networking,and motivate research in new trend-setting directions. Hence, we review inthe following sections some representative CI methods for quality of service(QoS) provisioning by call/connection admission control, adaptive allocationof resources and traffic management. Some important contributions to mul-ticast routing, multimedia composition, streaming and media synchroniza-tion, and multimedia services/servers are also surveyed. Most of the methods

P. Guturu: Computational Intelligence in Multimedia Networking and Communications: Trends

and Future Directions, Studies in Computational Intelligence (SCI) 96, 51–76 (2008)


52 P. Guturu

available in the current literature are either fuzzy or neural network basedthough some papers adopted a hybrid approach of using neuro-fuzzy con-trollers. A few papers present genetic/evolutionary methods for problems inmultimedia communications. From these applications, it appears that the var-ious computational intelligence frameworks are not competitive, but rathercomplementary. For the sake of completeness, we present a brief review of thecomputational intelligence paradigm in the following subsection.

1.1 Computational Intelligence Paradigm

According to Wikipedia, the free online encyclopedia, computational intelli-gence (CI) is a branch of artificial intelligence (AI) that combines elementsof learning, adaptation, evolution and fuzzy logic (as well as rough sets) tocreate programs equipped with intelligence to solve problems effectively. Ituses meta-heuristic algorithms and strategies such as statistical learning ma-chines, fuzzy systems, neural networks, evolutionary computation, swarm in-telligence, artificial immune systems, etc. In contrast, the traditional AI (or,GOFAI, i.e., good old-fashioned artificial intelligence, as per the term coinedby John Haugeland, professor of philosophy at the University of Chicago), re-lies on symbolic approaches. In this subsection, we present an overview of onlythose CI techniques that have been used in the multimedia communicationand network research documents cited in the present survey.

Neural Networks

An artificial neural network (ANN) or simply neural network (NN) is an inter-connected set of simple nonlinear processing elements called neurons becauseof their role similar to neurons in a biological system. The neurons in anANN take inputs from either external environment or other neurons in thesystem. The neuronal outputs may similarly be transmitted to either otherneurons (through interconnection weights) or external environment. The neu-rons that take inputs from and send outputs to exclusively other neurons arecalled hidden neurons. These hidden neurons have been found to be pivotal tolearning of complex input–output mappings. The methods for adaptation ofinter-neuron weights based on the observed outputs to obtain desired outputsare called NN training or learning methods. The NN interconnection patternsare called topologies. The most popular NN topology and the associated learn-ing algorithm are feed-forward neural network (FFNN) and back-propagationlearning (BPL) algorithm, respectively. FFNN is also known as multi-layerperceptron (MLP). In an FFNN, neurons are arranged into multiple layersconsisting of an input, an output, and one or more hidden layers with unidi-rectional inter-layer neuronal connections (weights) from the input through tothe output layer as shown in Fig. 1. Determination of inter-layer connectiontopologies, and the number of hidden layers as well as the number of neuronsin each of them, based on the problem being solved, are open research issues.

Computational Intelligence in Multimedia Networking and Communications 53

Layer

InputLayer

Hidden Layers

Output

Fig. 1. A typical four layer feed-forward neural network

Still, simple three-layer FFNNs with total inter-connectivity between neuronsin consecutive layers as shown in the figure have been successfully applied tomultimedia and other applications where system adaptability and capabilityto learn complex functional dependencies of outputs on inputs are of para-mount importance. The standard BPL algorithm used for training the FFNNinterconnection weights is a supervised learning algorithm, i.e., one with atraining set of input–output pairs). In this algorithm, the errors are first com-puted at the output layer as the differences between the desired and observedoutputs for training sample inputs, and then the inter-neuronal connectionweights from the neurons in the layer preceding the output layer to thosein the output layer are updated (using mathematical formulae) to producethe desired outputs. The errors in the outputs of the previous stage neuronsare also similarly computed, and the process of computing the weights andthe neuron output is repeated for different layers in the FFNN proceeding inthe backward direction till the input layer is reached. A detailed discussionof the FFNNs and the BPL may be found in [1].

A recurrent neural network is a generalized neural network in which bidi-rectional asymmetric interneuronal connections are possible; it does not needto have a layered organization of neurons. A recurrent NN training algo-rithm, which is similar to the BPL (because of almost the same mathematicalformulae for updating interneuronal weights) and hence known as the recur-rent back-propagation (RBP) algorithm, has been proposed independently byAlmeida [2], and Pineda [3]. A special form of recurrent NN is the Hopfieldneural net (HNN) [4], which uses binary threshold gates as processing elements(neurons), a totally connected network topology, and symmetric interneuronalconnection weights. An HNN network may be configured to find the local op-tima (minima) of criterion functions in some problems if those functions canbe cast in the form of the following energy function related the Ising model [5]in physics:

54 P. Guturu

E = −12

∑i<j

∑j

WijSiSj +∑

i

θiSi (1)

where Wij is the interconnection-weight between the neurons i and j, Si thebinary (0 or 1) state (output) of the ith neuron, and θi is the threshold usedto compute the output of the ith neuron from the sum of its input excitations.

Gelenbe [6] proposed a novel neural network model called the randomneural network (RNN) and applied it various problems including those re-lated to multimedia communications. The RNN is a set of neurons in whicheach neuron has a potential (an integer random variable). A neuron is said tobe excited if its potential is strictly positive, and, in that state, it randomlysends signals (to other neurons or to the environment) according to a Poissonprocess with a specific rate. The potential of the neuron sending the signalis always decreased by 1 irrespective of whether the signal sent is positive ornegative. The potential of a neuron receiving a signal from another neuronor the environment increased or decreased by 1 depending upon whether thesignal received is positive or negative. In [6], Gelenbe establishes a connectionbetween the RNN and queuing networks. The weights (actually, the proba-bilities for sending positive and negative signals from a neuron to another,and the probabilities for sending/receiving to/from an excitation from theenvironment) of an RNN can be trained using an algorithm resembling theback-propagation algorithm in the classical neural networks.

While the above discussed neural networks employ the supervised learn-ing (learning with a teacher, i.e., a training set of pattern vectors with theirclassification labels) paradigm, self organized feature map (SOFM) proposedby Kohonen [7] is a neural network with capability to learn its own weights.A SOFM (sometimes called SOM) is modeled after the human brain in whichdifferent types of sensory information are processed by different parts. In aSOFM, each input is connected though a synaptic weight to all the outputneurons arranged in a two or three dimensional grid. Thus every neuron inthe system has an input weight vector with the same dimensionality as theinput pattern vectors. At the beginning of the learning process, these weightsare initialized to small random values. Then, as and when the pattern vectorsare presented at the input of the network, and the neuronal weight vectorwhich is the closest (according to the Euclidean distance measure) to theinput pattern vector is determined. The corresponding neuron is called theBMU (best matching unit). The weight vectors of BMU and its neighbors onthe grid are adjusted towards the input pattern. By repeated training withinput patterns this way, the SOFM learns to produce neuronal activations indifferent locations of the network depending upon the input patterns. Oncethe network is completely trained this way, a new pattern may be input tothe network and classified based on the location where the neuronal activityis produced. It may be noted here that input patterns used in the trainingphase are unlabeled samples, and hence the SOFM may be categorized anunsupervised learning method.


Reinforcement Learning

Mathematically, the reinforcement learning (RL) [8] system model is a tripletMRIL = {S,A,R} where S is the set of states of a problem environment, A isthe set actions that can be taken by an agent seeking to solve the problem, andR is the set of scalar rewards associated with an action and the current stateof the system. In the RL paradigm, an agent perceives, at each time instant t,the current state st(∈ S) of the environment and the set of actions A(st) ⊆ Athat can be taken based on that state, and chooses an action a ∈ A(st).The chosen action results in a reward r ∈ R, and drives the environmentinto a new state st+1. An RL formulation seeks to determine the optimalpolicy (or the series of actions the agent needs to take) to maximize thetotal reward. In this formulation, there is no concept of supervised learningof optimal system parameters by means of corrective actions based on theerrors in the observed system outputs for a given training set of input–outputpairs. Instead, the choice of actions is aided by the finite-state Markov decisionprocess (MDP) [9] model of the environment. Even though it is not necessaryto make use of ANNs for implementation of an RL formulation, it is usuallythe case to include them as a part of the solution.

Fuzzy Logic Based Intelligent Control

In any control system, the actions to control some aspect of system perfor-mance, e.g., congestion control, maximum end-to-end delay in the network)are based on the system inputs, e.g., message packet loss, link delay). How-ever, for robust control, the system needs to be capable of managing theuncertainties in the system environment and imprecisions in the system inputmeasurements. A mathematical formalism useful for the design of such robustsystems was pioneered by Zadeh [10], and was first applied by Mamdani [11]to control system design. It is variously known as fuzzy set theory or fuzzylogic. The logic variables, e.g., congestion) in fuzzy set theory do not takecrisp binary (false or true) values, but take continuous values (called mem-bership values) in the range [0,1]. Another class of fuzzy logic variables thatcapture our vague, ambiguous, qualitative or subjective view of the worldare termed as linguistic variables. They may be loosely defined as variablesthat take graded membership or simply linguistic values, e.g., high, medium,low, rough, smooth, etc.). Modern fuzzy control systems use a set of fuzzylinguistic rules of the form given below to derive inferences about the outputvariables from the input variables and use the output estimates so obtainedfor control:

If packet loss is low, and network delay is high, network congestion is medium.

Since such rules are gathered from experts in the field, such control systemscome under the class of Fuzzy Expert Systems. Each rule (proposition) in

56 P. Guturu

Fig. 2. Block diagram of a fuzzy logic system

the rule base of the fuzzy control system is associated with a membershipvalue (degree of truth) in the range [0,1]. The membership values of variousapplicable propositions are aggregated by a properly designed inference en-gine in a fuzzy logic system and the system outputs are estimated. Figure 2depicts the block diagram of a typical fuzzy logic based system for obtainingthe control parameters with the problem state vector as its input.

The fuzzifier module in the system converts the crisp input values intolinguistic values such as high and medium so that inference engine can gen-erate the fuzzy values for the output parameters using rules such as the oneindicated above from the rule base. In case of applicability of more than onerule, the membership values of different rules are aggregated to obtain a con-sistent estimate. The defuzzifier then converts the fuzzy values of the outputvariables into crisp values. In this article, a few neuro-fuzzy applications tomultimedia communications are also presented. These methods typically useneural networks for learning both the rules and the membership functionsassociated with the rules.

Rough Sets

Rough set theory (RST) is another approximate reasoning formalismdeveloped for handling imprecision. Here, the values of a set of attributesare represented by two sets: one for the lower and the other for the upperapproximation the original crisp set. Even though the upper and lower sets inthe original formalism of Pawlak [12] are crisp sets, they could as well be fuzzysets. A rough set inference mechanism similar to fuzzy inferencing could beused for estimation of system parameters. The RST uses the an informationsystem framework (ISF) I, which may be defined as a tuple (O,A), where Ois a non-empty set of objects, and A is a set of their attributes. In ISF, aninformation table maps the value tuples (of the attributes) onto the objects.Two objects are defined to be discernible if they can be distriguished basedon their value tuples. This discernability relationship between objects inducesan equivalent partition among the objects. In case each partition so obtainedis a singleton, the every object of the system can be distinguished from thegiven set of attributes. Now, when we consider a subset of the attribute set A,the target set T (⊆ O) of objects cannot be expressible exactly because somesubsets of objects in T could be indiscernable. Hence a rough set involving


an upper approximation set of objects possibly in the target and a lowerapproximation set of objects positively in the target may be used to representthe target. From the rough sets so constructed, it is possible to obtain reductsubsets of attributes, that is, the subsets of attributes, each of which inducethe same equivalent partition on O as the original set A. The reduct subset isnot unique, because different subsets of A could induce the same equivalentpartition. The intersection of such reduct subsets gives the core (or indispens-able) set of attributes of the information system I. Similarly, when the unionof all reduct sets is removed from A, we get the set of superfluous attributes.Thus the rough set is a useful tool for capturing the knowledge representedin the information system with lesser number of attributes.

Dubois [13] extended the formalism of RST by introducing rough fuzzysets and fuzzy rough sets. Among the applications of RST, the RST basedapproaches proposed by Stefanowski [14] for induction of decision rules, andZiarko’s [15] rough set methodology for data mining are worth mentioning.

Evolutionary Computation

Evolutionary computation (EC) is the generic name for a number of alliedbiology-inspired technologies such as evolutionary programming (EP) [16],genetic algorithms (GAs) [17], evolution strategies (ES) [18, 19], and geneticprogramming (GP) [20]. The goal of an EC algorithm is to find a quasi-optimalsolution to a problem by mimicking the genetic evolutionary processes. Partic-ularly when the optimality measure on a large set of variables characterizingthe problem solution turns out to be a non-convex multi-modal function, anexhaustive search for an optimal solution is ruled out because of exponentialsearch complexity. In this situation, the EC approach is an effective strategyfor intelligent exploration of the search space to find near-optimal solutions. Inthis approach, the search starts with an initial population of candidate solu-tions, each represented by a vector of randomly chosen values for the problemsolution variables. Now, based on an analogy between the process for obtainingoptimal solution and genetic evolution, the solution vector may be consideredas an equivalent of a chromosome with individual components of the vectorrepresenting the genes. The optimality measures (or equivalently, the fitnessfunctions) of the individual candidate solutions in any generation includingthe initial one may be computed using the functional form of the measure,and the solutions may be ordered based on their fitness values. The candidatesolutions for the next generation (offsprings) may then be obtained by using acrossover operation on parent solutions usually chosen from the population ofthe current generation, using the so-called elitist strategy, that is, the strat-egy of choosing the participants of the cross-over randomly from a selectedfew members (with the highest fitness value) of the population of the cur-rent generation. The traditional crossover is based on the concept of exchangeof the genetic material between parents (without many constraints), but onecan also design a new crossover mechanism, based on a given optimization

58 P. Guturu

problem being dealt with. Crossover points are also chosen randomly. An-other very important genetic operator, next to the above discussed crossover,used in EAs is mutation. It simulates genetic mutation by replacing the valueof a randomly chosen component of solution vector with a new value fromthe set of values admitted for the component. By producing successive gen-erations of new populations by selective replacement of the members of anold generation by the fittest among the new members (offsprings) producedwith the help of the two operators discussed above, genetic evolution may becontinued for a number of generations to obtain solutions closer and closerto optimality. Problem representation, design of crossover and mutation op-erators, strategies for replacement of the members of the old population withnew members, and optimal choice of EC parameters such as population size,number of generations for evolution, etc. are open research issues of this area.

2 Call/Connection Admission Control

Call admission control (CAC) is a mechanism to determine whether resourcesrequested by an incoming multimedia call could be reserved without adverselyaffecting the QoS requirements of the on-going calls. In ATM multimedia net-works, this is tantamount to connection admission control (with the sameabbreviation: CAC), which is a decision-making process to accept or reject arequest for a new virtual path (VP) or virtual channel (VC) based on the an-ticipated traffic characteristics of the connection, the requested QoS, and thecurrent state of the network. Traditional CAC schemes make use of variouscriteria for call/connection admission such as equivalent capacity or band-width requirements of various links, maximum allowable cell loss probability,network traffic load, etc. To address some deficiencies of these methods suchas failure to meet QoS requirements in heavy traffic conditions, computationalintensiveness of the call parameter estimation methods, etc. a number of CI-based CAC methods have been proposed in the literature. In the sequel, wediscuss a few representative methods.

One of the earliest CI-based approaches to multimedia CAC in ATMnetworks is due to Hiramatsu [21]. In this approach, he uses a three-layerfeed-forward neural network (FFNN) with the standard back-propagationlearning algorithm to obtain predicted service quality parameters and callacceptance/rejection decision values (such as predicted values for call arrivalrate and cell loss rate, and the call rejection rate) with observed multiplexerstatus parameters such as cell arrival rate, cell loss rate (CLR), call gener-ation rate, trunk utilization rate, number of connected calls, as the FFNNinputs. Simulation results indicate the adaptability of the proposed methodin learning complex admission control decision policies. In [22], he addressesthe problem of training neural networks with exponentially wide ranged QoSparameters, e.g., CLR ranging from 10−12 to 1) using two methods: (i) train-ing with a relative target: here the neural network is assumed to memorize the


logarithm of the average of K-recent monitored QoS values, and the new targetis an updated average derived from a weighted summation of a new sampleand the memorized QoS, and (ii) virtual output buffer method wherein aneural network is trained to accurately estimate the QoS for the actual bufferby incremental extrapolations using the data from smaller capability virtualbuffers (a set of counters simulating an imaginary cell buffering process).

Youssef, Habib, and Saadawi [23] propose a neurocomputing approach toCAC and bandwidth allocation in ATM networks. The algorithm proposed bythem employs a hierarchical structure of a bank of small-sized parallel neuralnetwork (NN) units to calculate efficiently the bandwidth required to supportmultimedia traffic with multiple QOS requirements. Each NN unit is a FFNNtrained using the standard back-propagation algorithm the complex nonlinearfunction relating different traffic patterns and QoS, with the correspondingreceived capacity. The NN controller calculates the gain obtained from mul-tiplexing multiple streams of traffic supported on separate virtual paths, i.e.,class multiplexing) also so as to enhance the statistical multiplexing gain.The authors use simulation results to prove that their NN approach is moreaccurate in bandwidth calculations and consequently CAC decision-makingcompared to two conventional approaches that use the stationary state ap-proximation of the equivalent capacity method [24], and class related rule [25],respectively.

References [26] and [27] independently propose fuzzy approaches for esti-mation of the CLR, which is an important CAC parameter. In [26], Ueharaand Hirota estimate possibility distribution of the CLR as a function of thenumber of calls in different transmission rate classes. Starting with an initialfuzzy rule base for CLR estimation, successive generations of fuzzy inferencerules are generated by incremental updates based on the CLR data observedfrom time to time. A back-propagation algorithm (unrelated to the neuralnetwork algorithm with the same name) is used for tuning the rule base withnew data. Using fuzzy α-cut theory [28], self-compensation of CLR estimationerrors is achieved, and then, by applying the latest rule base, an upper boundon the CLR estimate is obtained and used for CAC decision making. Bensaouet al. [27] propose a robust fuzzy based algorithm to predict the CLR in ATMmultiplexers, and use the CLR estimate so obtained for call admission con-trol. Unlike many traditional approaches, their method does not presume anyinput traffic model or parameters, but employs the knowledge of a set of CLRvalues for small values of an independent variable, e.g., multiplexer buffer sizeor service capacity) of the CLR function, in conjunction with the knowledgeof the asymptotic behavior of the function for larger values of the variable.

In [29], Ren and Ramamurthy propose a dynamic CAC (for ATM multime-dia networks) that employs fuzzy logic to combine a CAC based on the UPC(user parameter control) model with that based on measured online trafficstatistics for determining the dynamic equivalent bandwidth used in CAC de-cision making. Simulation results indicate that substantially improved system

60 P. Guturu

utilization can be achieved with dynamic CAC compared to a model-based ora measurement-based CAC.

Liang, Karnik, and Mendel [30] propose an interesting connection admis-sion control algorithm for ATM networks that uses type-2 fuzzy logic rulebase incorporating the knowledge obtained from 30 network experts. In type-2 fuzzy logic, the membership value of each element in the fuzzy set is itselffuzzy. The type-2 fuzzy logic used in their system , in contrast to type-1 fuzzylogic, provides soft decision boundaries, and thereby permits tradeoff betweenCLR and bandwidth utilization.

Cheng, Chang, and their coworkers are one of the earliest to adopt ahybrid CI approach to call admission control. An IEEE journal article [31]and a US patent document [32] together present the details of their neuralfuzzy CAC (NFCAC). The NFCAC takes in three linguistic inputs, availablecapacity, congestion indicator, and cell loss ratio, and outputs a decision signalto accept or reject a new call request. The fuzzy estimates of the availablecapacity and the congestion indicator are, in turn, done by a fuzzy bandwidthestimator and a fuzzy congestion controller proposed in their earlier work[33]. The NFCAC is a five layered feed-forward neural network with a two-phase hybrid learning algorithm. Construction of fuzzy rules and location ofmembership functions is done by a self-organized learning scheme in phase-Iwhereas optimal adjustment of membership functions for desired outputs isdone by a supervised learning scheme in phase-II. The authors show by meansof simulation results that their NFCAC, despite the simplicity of its design,can satisfy the QoS requirements, and still achieve higher system utilizationand learning speed compared to a traditional effective-bandwidth-based CAC[34], and the fuzzy-logic-based [33] and neural-net-based [35] CACs proposedby them earlier.

In [36], Chatovich, Oktug, and Dundar propose a hierarchical neural-fuzzy connection admission controller for ATM multimedia networks. ThisCAC is based on Berenji and Khedkar’s GARIC (Generalized ApproximateReasoning-based Intelligent Controller) architecture [37] that includes two co-operating neural networks, one called ASN (Action Selection Network) forimplementing fuzzy inference rules initially acquired from an expert, andthe other called AEN (Action Evaluation Network) for performance evalu-ation and fine tuning of the former by the reinforcement learning approach.The ASN is organized as a hierarchical structure that combines three sub-controllers, one for each one of the three system variables, CLR, queuesize, and link utilization, and comes up with the final decision by weightedaggregation of the decisions of the three sub-controllers.

In [38], Shen et al. address the problem of bursty wireless multime-dia traffic with unpredictable statistical fluctuations in wide-band CDMA(Code Division Multiple Access) cellular systems, and propose an intelligentCAC (ICAC) that makes call admission decisions based on QoS parameterssuch as handoff call drop probability, outage probabilities of various servicetypes, existing-call interference estimates, the link gain, and the estimate of


equivalent interference of the call request. Estimation of the existing call in-terference in ICAC is done by a pipeline recurrent neural net (PRNN) whichpredicts the mean value of the system interference for the next period asa function of p: measured interference powers, and q: previously predictedinterference powers, where p and q are the fuzzy estimator subsystem para-meters. For equivalent interference estimation, ICAC uses a fuzzy estimatorthat takes in as input four parameters of the new call: peak and mean trafficrates, peak traffic duration, and the outage probability requirement. The fuzzycall admission processor of ICAC uses the two interference estimates providedby the fuzzy estimator and PRNN in conjunction with other QoS informa-tion to make a four-fold decision: {Strong Accept, Weak Accept, Weak Reject,Strong Reject}. Simulation results comparing ICAC with two traditional CACmethods, namely PSIR-CAC (predicted signal-to-interference ratio CAC) andMCAC (Multimedia CAC), indicate that ICAC achieves higher system capac-ity than PSIR-CAC and MCAC by more than 10% in traffic ranges where QoSrequirements are guaranteed. The ICAC has been found to fulfill the multipleQoS requirements under all traffic load conditions whereas conventional CACschemes fail under heavy traffic load conditions.

Ahn and Ramakrishna [39] propose an interesting Hopfield neural network(HNN) based CAC algorithm for QoS provisioning in wireless multimedianetworks. The QoS provisioning problem is formulated as a multi-objectiveoptimization problem that seeks to maximize the twin objectives of resourceutilization and fair distribution of resources (among different connections)subject to the constraint that the total allocated bandwidth cannot exceedthe available capacity. The authors show that the overall objective functioncan be cast into the form of HNN energy function given in the equation ( 1) sothat an HNN with n×m neurons (for an n-connection m-QoS level problem)can be set up to minimize the energy function and produce stable and feasibleQoS vector values.

In [40], Sinouci, Beylot and Pujolle formulate call admission control as asemi-Markov decision problem (SMDP), and develop a reinforcement learning(neuro-dynamic programming) based algorithm for construction of a dynamiccall admission policy. The algorithm is implemented using both table lookupand feed-forward neural network approaches for determination of the Q-values(state-action tuples) of their system based on the number of current calls intwo traffic classes, and the characteristics of the new call, e.g., handoff ornew, class 1 or class 2 type). Call admission decision (accept or reject) ismade using the action value obtained using this approach, and the systemis trained using the reward associated with success of accepted calls. Theirneural network based CAC is naturally more memory efficient compared tothe table lookup implementation. The proposed method yields an optimalsolution at much higher speed compared to traditional approaches, which arealso difficult to optimize.

For the reverse link transmission in the wideband code division multi-ple access (WCDMA) cellular systems, Ye, Shen, and Mark propose a CAC

62 P. Guturu

scheme using fuzzy logic [41]. In their scheme, a fuzzy call admission proces-sor uses the estimates on the effective bandwidth and network resources alongwith the QoS parameters as inputs to output a call acceptance or rejectiondecision. Effective bandwidth, in turn, is estimated by a fuzzy estimator usingcall request parameters and pilot signal strength information as inputs. Pilotsignal strength is also used by a fuzzy estimator to produce mobility estimate,which is used in conjunction with the effective bandwidth and bit energy tonoise-plus-interference density ratio of the traffic class under consideration bya fuzzy network resource estimator to produce the network resource estimaterequired by the fuzzy call admission processor. The authors provide simulationresults to compare their approach with two previously proposed traditionalCAC schemes, received power-based call admission control (RPCAC) [42] andnon-predictive call admission control (NPCAC) [43], and demonstrate its ef-fectiveness in terms of new and handoff call blocking probabilities, outageprobability, and resource utilization.

3 Adaptive Allocation and Management of Resources

Allocation of resources is intimately related to call admission control andQoS management. Hence, in case of multimedia applications requiring highthroughput, it turns out to be a problem of paramount significance that needsto be handled intelligently. Considering the need for continual revision of band-width allocations to different calls in high traffic situations, Sherif et al. [44]propose a genetic algorithmic approach to adaptive allocation of resources andcall admission control in wireless ATM networks. In their scheme, QoS require-ments for each of the video, audio, and data sub-streams of a multimedia callcan be specified from a 4-tuple {High, Medium, Low, Stream Dropped} withthe possibility for a total of 64 (43) Q-values for the call as a whole. Assumingthat the maximum to minimum Q-value range for each call is available fromthe user data profile, they formulate the problem of adaptive allocation (incontrast to the traditional static allocation) of bandwidth for existing calls asan optimization problem to minimize the spare capacity (after call allocation)in the cell without either overshooting the cell capacity or going below theminimum Q-value of any call in the cell. This, being a non-convex optimiza-tion problem, has been solved using the genetic approach. Simulation resultsindicate the adaptability of the algorithm to high traffic situations, gracefuldegradation of individual user QoS levels with load, and effective and fairutilization of available bandwidth with increased number of admitted calls.

Yuang and Tien [45] propose an intelligent multiple access control sys-tem (IMACS) with facility for dynamic bandwidth allocation for wirelessATM multimedia networks. The IMACS consists of a multiple access con-troller (MACER), a traffic estimator and predictor (TEP), and an intelligentbandwidth allocator (IBA). MACER employs a hybrid-mode TDMA (timedivision multiple access) scheme with advantageous features of CDMA (code


division multiple access) and contention access based on a novel dynamic-tree-splitting collision resolution algorithm parameterized by an optimal splittingdepth (SD). TEP periodically estimates the key Hurst parameter of availablebit rate (ABR) self-similar traffic by wavelet analysis, and then predicts themean and variance of subsequent frames using a six-layer neural fuzzy on-line traffic predictor (NFTP). Based on these predicted values, IBA performsefficient bandwidth allocation by determining the optimal SD, achieving sat-isfactory SCR (Signaling ContRol) blocking probability and ABR throughputrequirements, while retaining maximal aggregate throughput. The NFTP al-gorithm achieves speed by learning in parallel the structure of the fuzzy if-thenrules as well as the parameters for tuning the coefficients of the rules to inputtraffic dynamics.

For hierarchical cellular systems supporting multimedia services, Lo,Chang, and Shung [46] propose a neuro-fuzzy radio resources manager, whichessentially contains a neural fuzzy channel allocation processor (NFCAP). Thetwo layer architecture of the NFCAP includes a fuzzy cell selector (FCS) inthe first layer and a neural fuzzy call-admission and rate controller (NFCRC)in the second layer. Using the user mobility, resource availabilities in bothmicro and macro cells, and handoff failure probabilities as input linguisticvariables, the FCS comes up with a cell selection decision. The NFCRC thencomes up with CAC and rate control decisions using the handoff probability,and the resource availability of the selected cell as input variables. Authorsestablish through simulations that their method enhances system utilizationby 31.1% with a marginal 2% increase in handoff rate compared to overflowchannel allocation scheme [47]. Compared to combined channel allocation [48]and fuzzy channel allocation control [49] schemes proposed by them earlier,the NFCAP is shown to provide 6.3 and 1.4% better system utilization andstill achieve handoff rate reduction by 14.9 and 6.8%, respectively.

For third generation wireless multimedia networks demanding highthroughput and QoS guarantees, Moustafa, Habib, and Naghshineh [50]propose an evolutionary computational model based wireless radio resourcemanager (RRM) that controls both the transmission power and the bit rateof the mobile devices cooperatively. Adaptive control of these parameters isachieved by the RRM on a continual basis by solving an optimization problemseeking to maximize a multi-modal objective function expressed as the sumof the total number users satisfying minimum signal quality requirements (as-sessed from their bit error rates), and the total reward for better utilizationof resources, considering the relative reward values, for the correspondingusers, for bandwidth utilization beyond their guaranteed minimum levels andfrugal use of power. Experimental results indicate that their algorithm helpsto reduce the infrastructure costs by requiring fewer base stations becauseof 70% more coverage on the average by each base station implementing thealgorithm. Other benefits include significant decrease (40%) in call blockingrate, longer battery life of the mobile unit because of frugal use of power, andminimal interference among the users.

64 P. Guturu

Motivated by the need for addressing the scarcity and large fluctuationsin the availability of link bandwidth by the development of adaptive multi-media services, Fei, Wong, and Leung [51] propose a reinforcement learningapproach for QoS provisioning by dynamic adjustment of the bandwidth allo-cations for individual ongoing calls. In this paper, the CAC and the dynamicbandwidth allocation problems are formulated as Markov decision processes(MDP) and solved using a form of real-time reinforcement learning schemecalled Q-learning. In their formulation, whenever an event such as new orhandoff call arrival, call termination, call handoff to neighboring cell occurs ina cell, an optimal policy, or equivalently an appropriate set of actions, e.g., ac-ceptance/rejecion of a new/handoff call, bandwidth upgrading/downgradingof an ongoing call), is determined by maximization of the expected rewardfunction subject to two QoS constraints- handoff dropping probability andaverage allocated bandwidth for a cell. The Q-learning approach permits ef-ficient handling of the large state space, i.e., configuration of on-going callsof different types at a point in time) of the MDP problem without any priorknowledge of state transition probabilities. Simulation results indicate thatthis algorithm outperforms some traditional approaches [52,53] in bandwidthutilization and call drop reduction.

4 Multimedia Traffic Management and CongestionControl

Effective management of multimedia traffic is essential for guaranteed QoS.The traffic management entails a number of operations: (i) call admissioncontrol (CAC), (ii) traffic policing, (iii) traffic characterization and predic-tion, (iv) rate/flow control, (v) routing and link capacity assignment, and (vi)congestion control. Since CAC is a topic well addressed in the literature onCI methods for multimedia communications, we devoted a separate section inthe beginning for this topic. For similar reasons, we will be dealing the traf-fic routing problem separately with particular emphasis on multicast routingin the following subsection. The remaining topics related to traffic manage-ment will be considered in this section. For a more comprehensive review ofATM traffic management, one may refer to the survey papers of Dodigerisand Develekos [54], and Sekercioglu, Pitsillides, and Vasilakos [55].

Call admission and resource allocation must necessarily be followed bypolicing of the multimedia network usage, and enforcement of proper usageso as to avert congestion and network delays. This process is also called us-age parameter control (UPC) because it involves continuous monitoring ofthe sources for operation within the limitations of their respective CAC para-meters negotiated during call setup phase. Next, traffic characterization andprediction is necessary for both CAC and flow control functions. Proper rout-ing of traffic is essential for link capacity management, and hence congestioncontrol. Finally, congestion control may also be done by rate control using a


feedback mechanism. Thus all the traffic management functions are closelyinter-linked.

As in case of CAC, Hiramatsu [56] is one of the earliest researchers toemploy neural networks for integrated ATM traffic control also. He proposesthree levels of NNs: (i) cell transfer level NNs for call transmission patternrecognition, service class suggestion, and window size control, (ii) call levelNNs for admission control of bursty calls and multi-level quality control, and(iii) network level NNs for optimal routing, link capacity assignment, andcongestion control. For system efficiency, he proposes a two-phase training forthe distributed system of NNs where separate training of individual NNs in thefirst phase is followed by the training of the whole system in the second phase.Addressing the link capacity assignment problem, he first estimates the CLRvalues for various links using a bank of three layer FFNNs with call generationrates and logical link capacities for the corresponding links as FFNN inputs.A neural network in the next higher level of hierarchy uses these estimatedCLR values along with logical and physical link capacities as inputs, andperforms multi-objective optimization by seeking to minimize maximum CLRvalue and maximize link utilization. The BP algorithm is used to train theneural network for estimation of CLR. For objective function optimizationby the higher level network, Hiramatsu uses Matyas’ random optimizationmethod [57].

In [58], Tarraf, Habib, and Saadawi show how a comprehensive NN so-lution to traffic management problem in ATM multimedia networks can beworked out by integrating some of the earlier proposed NN-based methods fordifferent aspects of traffic management. They divide the traffic managementfunctional module into three submodules that operate at cell, call and net-work levels. The states of the buffers that maintain various types multimediainformation together with the output of bandwidth assignment module at theUNIs (user-network interface), i.e., the access nodes to the network, providethe information required for processing at three control modules. The calllevel control function is implemented as a two-level hierarchy of feed-forwardnetworks where two first level neural networks separately compute the ser-vice quality parameters such as call arrival rate, CLR, call rejection rate fromthe declared call parameters and history of the past observed status, and thestatistical parameters of the aggregate link traffic from traffic measurements.The NN CAC at the second level aggregates the information from the firstlevel neural networks and comes up with a decision to accept or reject a call.The authors propose that the traffic characterization and prediction at the celllevel may be done using any of the earlier NN-based methods, e.g., [59, 60])using the states of the UNI buffers as the NN inputs. The predicted trafficoutputs from the NN then are used by another NN for policing as proposedin [61]. Finally, for the network level traffic control, the authors suggest imple-menting either of the two earlier proposed neural network congestion controlmechanisms [62,63].

66 P. Guturu

Pitsillides, Sekercioglu, and Ramamurthy [64] use peak, minimum, andinitial cell rates obtained by monitoring ABR queue lengths, additive in-crease rate, and control interval as inputs to their fuzzy congestion controlsystem for estimation of flow rates which are provided as feedback to the traf-fic sources. Results of simulation experiment to compare their method withthe EPRCA (Enhanced Proportional Rate Control Algorithm) [65] indicatethat their algorithm fares better with respect to end-to-end delay, speed oftransient response, and network utilization.

Lin et al. [66] propose a genetic algorithm based neural-fuzzy inferencenetwork for extraction of the features characterizing traffic at each node ofa binary decision tree that is used in mixed scheduling of ATM traffic- ahybrid of rate monotonic and deadline driven approaches. The authors usesimulations to show the effectiveness of the proposed GA based neural fuzzynetwork in learning optimal solutions compared to similar networks trainedwith the BP algorithm.

In [67], Chen et al. present an approach to traffic control that uses a fuzzyARMAX (autoregressive moving-average model with auxiliary input) processfor an effective modeling of nonlinear time-varying and time-delayed proper-ties of multimedia traffic. In this model, traffic from controllable sources suchas ABR traffic represents the fuzzy ARMA component and the uncontrollabletraffic such as CBR (Constant Bit Rate), and VBR (Variable Bit Rate) traffic,is considered as external disturbance. Simulation results indicate that theirmethod for fuzzy adaptive control of traffic flow using traffic prediction basedon this model is superior to other competitive approaches with respect to cellloss rate and network utilization.

5 Routing and Multicast Routing

As indicated in the previous section, routing is a traffic management issuewith impact on QoS at the network level. Multicast routing is a special caseof routing of multimedia streams from a source to a number of destinations;it is pivotal to applications such as video conferencing, tele-classrooms, andvideo on demand. Needless to say, effective multicasting methods are alsoessential for QoS control.

Park and Lee [68] employ feedback neural networks for multimedia trafficrouting. They solve the problem of maximizing the connected cross-points ina crossbar switch for a given traffic matrix subject the constraint that onlyone cross-point is permitted to be connected in each row or column of theswitch, by casting the criterion function as an energy function that can beminimized by a Hopfield neural network.

Zhang and Leung [69] propose a novel genetic algorithm (GA) called or-thogonal GA for multimedia multicast routing under delay constraints. Intheir GA formulation, a chromosome is a multicast tree represented as a bi-nary string of size equal to the cardinality of the network link set. A value


of 1 (or zero) in the string indicates the presence (or absence) of the corre-sponding network link in the multicast tree. As a measure of quality of themulticast tree, the authors propose a fitness vector with two components: (i)cumulative path delays in excess of a configured threshold, and (ii) overallcost of the multicast tree. By lexicographic ordering of the vectors based ontheir component values, the multicast trees can be arranged in descendingorder of merit. An important aspect of the GA is an orthogonal crossover andmutation operation to generate j number of offsprings from n parents (the socalled n-j crossover and mutation algorithm). Since the offsprings so gener-ated may not necessarily be multicast trees (with connections from the sourceto destination nodes), the authors propose a check and repair operation also.Simulation results indicate that their orthogonal GA can find near optimalsolutions for practical problem sizes.

In [70], Mao et al. present a genetic algorithmic approach to multi-pathrouting of multi description (actually double description) video in wireless adhoc networks. In the multi-description multimedia encoding scheme that isgaining popularity of late, multiple streams corresponding to multiple equiva-lent descriptions of multimedia content generated from a source are transmit-ted to a destination which can use any subset of the source streams receivedto construct the original multimedia content with a quality commensuratewith the cardinality of the subset used. The authors show that the multi-pathrouting problem is a cross-layer optimization problem where the average videodistortion, i.e., an application layer performance metric, may be expressed asa function of the network layer performance metrics such as bandwidth, loss,and path correlation. Their final formulation turns out to be a problem of con-strained minimization of average distortion of the received video expressed asa function of individual and joint probabilities for receiving the multiple de-scriptions, and the computable distortions for media reconstruction using thereceived streams. The constraints for their optimization are loop free pathsand stable links. Due to the exponentially complex nature of the problem, theauthors resort to genetic approach for the solution considering each candi-date path constructed by random traversal from the source to destination asa chromosome and nodes (designated by their numbers and positioned in thesame order as on the path) as genes. For the double description video prob-lem, they use chromosome pairs as candidate solutions because two pathsare required for transmitting the two descriptions. For cross over, two suchpath pairs are considered, one string from each pair is randomly chosen, andcross over is performed using the first common node in the chosen strings asthe crossover point. Mutation on a chromosome (path) pair is similarly doneby choosing one of the strings and a node in the string randomly, and recon-structing the partial path from that node to the destination node by using anyconstructive approach without repeating any nodes in the partial path fromstart node up to (but not including) the chosen node. The authors providesimulation results to demonstrate superior performance of their approach (interms of the average Peak Signal to Noise Ratio of the reconstructed video)

68 P. Guturu

over several other approaches including the 2-Shortest path [71] and DisjointPath-set Selection Protocol (DPSP) [72] algorithms.

In [73], Wang and Pavlou formulate group multicast content delivery inmultimedia networks as an integer programming problem to compute a setof bandwidth constrained Steiner trees [74] with minimum overall bandwidthconsumption. Authors propose to represent the set of explicit Steiner treeswith shortest path trees (from the root node of the group to any receiver)through intelligent configuration of a unified set of weights of the networklinks. Accordingly, in their genetic algorithmic (GA) formulation, each chro-mosome is represented by a link weight vector with the size equal to thenumber of network links. Fitness of a chromosome is computed as a valueinversely proportional to the sum of the overall network load and excessivebandwidth allocated to overloaded links. A fixed population size of 100 is cho-sen, and GA evolution starts with an initial population of random vectors oflink wights in the range 1–64. In the crossover operation, the offsprings aregenerated by taking a chromosome from both the top and the bottom fifty(sorted with respect to the fitness values) and then choosing individual genesfor the offsprings from either parent with a predefined crossover probability.To escape from local minima, two types of mutation (changing the weight ofa link to a random value) are used: (i) global mutation of every link witha low mutation probability, and (ii) mutation of congested links. Results ofevaluation indicate that the proposed GA approach provides higher serviceavailability with significantly less bandwidth compared to the conventionalIP (Internet Protocol) approaches.

Neural network solutions to the allied problem of obtaining Steiner (mul-ticast) trees from the network graph may be found in the literature. Gelenbe,Ghanwani, and Srinivasan [75] demonstrate the use of random neural net-works for substantial improvement in the quality of the Steiner trees thatmay be obtained by using the best available heuristics such as the minimumspanning tree and the average distance heuristics. In [76], Xia, Li, and Yenpropose a modified version of SOFM (self-organizing feature map) [7] for theconstruction of balanced multicast trees.

6 Multimedia Composition, Encoding, Streaming,and Synchronization

Until now, our focus has been on control issues related to multimedia net-working. In this section, we survey the sparse literature on the CI methodsthat address pure communication issues related to multimedia data, such asmedia streaming, synchronization, encoding, etc. In the following section, wediscuss a few CI-based multimedia services.

A cost effective (network bandwidth efficient) solution to multimedia con-tent delivery in media browsing applications is low resolution content deliveryduring navigation. The idea here is to permit the users to easily and quickly


preview media sequences at various resolutions and zoom in on the segments oftheir interest. Doulamis and Doulamis [77] propose an optimal content-basedmedia decomposition (composition) scheme for such an interactive navigation.Though their proposal is for video navigation, the scheme can be extended tothe generic multimedia case. In their scheme, a set of representative shots isinitially extracted from the sequence to form a coarse description of the visualcontent. The remaining shots are classified into one or the other representa-tive shot class types. The content of each shot is similarly decomposed intorepresentative frames (frame classes) characterized by global descriptors suchas color, texture, motion parameters, and object (region) descriptors such assize and location. The other objects in the shot are classified into one or theother of the frame classes. The video decomposition problem is then posed as aproblem of optimally selecting representative shots (frames) so as to minimizethe aggregate correlation measure among the shots (frames) of the same class.In view of the exponential complexity of the search for optimal solution, theauthors use a genetic search using a binary string representation for denotingthe presence or absence of the shots (frames) in the sequence (shot) in therepresentative classes. The scheme is shown to offer a significant reduction (85to 1) in the transmitted information compared to the traditional sequentialvideo scanning.

In [78], Su and Wah propose an NN-based approach to compensation ofcompression losses in multi-description video streaming. To facilitate realtimeplayback, a three-layer FFNN in their system is trained in advance by theBPL algorithm using pixels from deinterleaved and decompressed frames asFFNN inputs, and those taken from the original frames (before compression)as desired outputs.

Bojkovic and Milovanovic [79] propose a motion-compensated discrete co-sine transform (MC-DCT) based multimedia coding scheme that optimallyallocates more bits to regions of interest (ROI) compared to non-ROI imageareas. Identification of ROIs is done by a two-layer neural network with theFFNN in the first layer for generation of the segmentation mask using thefeatures extracted from each image block, and the FFNN in the second layerfor improving the obtained segmentation by exploiting object continuity inthe segmentation mask provided by the first network and additional features.Authors indicate that their approach achieves better visual quality of im-ages along with signal-to-noise ratio improvements compared to the standardMPEG (Moving Picture Experts Group) MC-DCT encoders.

Automatic quantitative assessment of the quality of video streams overpacket networks is an important problem in multimedia engineering. Packetvideo quality assessment is a difficult problem that depends on a number ofparameters such as the source bit rate, the encoded frame type, the framerate at the source, the packet loss rate in the network, etc. A method forsuch an assessment, however, facilitates development of control mechanismsto deliver the best possible video quality given the current network situation.In [80], Mohamed and Rubino propose the use of Gelenbe’s random neural

70 P. Guturu

network (RNN) model [6]. Mohamed and Rubino show the results obtainedusing RNNs are well correlated with the results of subjective analysis usingour human perceptual system. Cramer, Gelenbe, and Gelenbe use an RNN-based scheme for video compression [81] and indicate that it is several timesfaster than H.261 and MPEG based compression schemes.

Addressing the problem of integrating user preferences with network QoSparameters for streaming of the multimedia content, Ghinea and Magoulas[82] suggest the use of an adaptable protocol stack that can be dynami-cally configured with micro-protocols for various micro-operations involved instreaming such as sequence/flow control, acknowledgement, and error check-ing/correction coding. Then they formulate the protocol configuration as amulti-criteria decision making problem (to determine streaming priorities)that is solved by a fuzzy programming approach to resolve any inconsisten-cies between the user and the network considerations.

Synchronized presentation of multimedia is an important aspect of nearlyall multimedia applications. In [83], Zhou and Murata adopt a CI approachto the media synchronization problem, and propose a fuzzy timing Petri Netmodel (FTPNM) for handling uncertain or fuzzy temporal requirements, suchas imprecisely known or unspecified durations. The model facilitates bothintra-stream and inter-stream synchronization with guaranteed satisfaction ofQoS requirements such as maximum tolerable jitters (temporal deviations) ofindividual streams, and maximum tolerable skew between media, by optimalplacement of synchronization points using the possibility distributions of theQoS parameters obtained from the model.

Considering the need for lightweight synchronization protocols that canreadily adapt to the non-stationary workload of the browsing process andchanging network configurations, Ali, Ghafoor, and Lee [84] propose a neuro-fuzzy framework for media synchronization on the web. In their scheme, eachmultimedia object i.e., video, voice, etc.) is segmented into an atomic unitof synchronization (AUS). With this, the media synchronization turns outto be a problem of appropriately scheduling the AUS despatches by webservers. The authors observe that this, in turn, is a multi-criteria schedul-ing problem with objectives to: (i) minimize tardy (deadline missing) AUSs,(ii) complete the transmission of bigger AUSs as close to their deadline aspossible, and (iii) minimize dropping of AUSs in the event of severe resourceconstraints. This problem being NP-hard, the solution is approached througha neuro-fuzzy scheduler (NFS) that makes an intelligent compromise amongthe multiple objectives by properly combining a number of heuristic schedulingalgorithms proposed by the authors. The NFS comes up with scheduling deci-sions taking the workload parameters and system state parameters as inputs.A two phase learning scheme is used in the NFS with self-organized learningin phase-I to construct the presence of rules and locate initial membershipfunctions for the rules, and supervised learning in phase-II to optimally ad-just the membership functions for the desired outputs. Simulation studiesfor a comparative assessment of the proposed method against several known


heuristic methods and a branch and bound algorithm demonstrate superioradaptability of the method under varying workload conditions.

One of the rare applications of rough set theory (RST) to multimediais due to Jeon, Kim, and Jeong [85]. The authors propose a novel method(with attribute reduction by application of RST) for video deinterlacing. Intheir method, they estimate the missing pixels by employing, on a pixel-by-pixel basis, one of the following four earlier proposed deinterlacing methods:BOB [86], WEAVE [86], STELA [87], and FDOI [87]. Their deinterlacingapproach uses four parameters: TD, SD, TMDW and SMDW. The first twoparameters refer to the temporal, and spatial differences, respectively, betweentwo pixels across the missing pixel, and the remaining two refer to temporaland spatial entropy parameters described in [87]. Using six video sequences asthe training data, the authors first construct a fuzzy decision table that mapseach set of the fuzzy values (small, medium, and high) of the attributes derivedfrom the above parameters onto a decision on the choice of an algorithm fromthe four mentioned above. The RST is then used for finding the core attributes,and eliminating superfluous attributes by constructing the upper and lowerapproximation sets of the target algorithms for the subsets of the attributes.With experimentation on a different set of six standard video sequences, theauthors establish the superior performance of their method over a number ofmethods presented in the literature.

7 Multimedia Services and Servers

Prediction of user mobility profile in wireless multimedia communication sys-tems is an essential support service for effective resources utilization underQoS constraints. Shen, Mark, and Ye [88] propose an adaptive fuzzy infer-ence approach to user mobility prediction in a wireless network operating inFrequency Division Duplex (FDD) mode with DS/CDMA (Direct Spread-spectrum CDMA) protocol. The essential components of their system are afuzzy estimator (FE) and a recursive least squares (RLS) predictor. The FEtakes in as input the strength of the pilot signal from the mobile user, andpredicts the probability of the user being in a particular cell using a rulebase that takes into account imprecision in measurements, and shadow ef-fects. The RLS predictor then improves upon the estimate obtained from FEusing the previous few values of the mobile position.

Addressing the problem of placing multiple copies of videos on differentservers in a multi-server video on demand system, Tang et al. [89] propose ahybrid genetic algorithm to determine the optimal video placement that mini-mizes batching interval and server capacity usage while satisfying pre-specifiedrequirement on blocking probability and individual server capacities. A chro-mosome in their GA formulation is an integer string with length equal tothe number of videos where each integer represents the number of copies ofa particular video. They use a fitness vector with two components: (i) the

72 P. Guturu

blocking probability (computed as∑

j qjBqjwhere qj is the portion of the

effective traffic allocated to a server (j), and Bqjis the blocking probability

for the server j that is computable using Erlang B formula [90] given thenumber of multicast streams j is handling), and (ii) the total capacity usage.For ranking the chromosomes in a population, a multi-objective pareto rank-ing scheme [91] is used. Offsprings in the GA are generated by multi-pointcrossovers and mutation. The exact size of the population used in their ex-perimentation is not explicitly stated in the paper. The experimental resultsindicate that the proposed algorithm converges to the best value on blockingprobability at around 1000 generations, and the best value on server capacityusage in less than 4000 generations.

In [92], Ali, Lee, and Ghafoor propose a design of multimedia web serverusing a neuro-fuzzy framework. The crux of their design is a neuro-fuzzyscheduler (NFS) for synchronized delivery of multimedia documents. In theprevious section, an overview of this scheduler has already been presented inthe context of media synchronization problem using a more comprehensivejournal publication by the same authors [84].

8 Challenges and Future Directions

Even though many CI-based approaches are being proposed for various appli-cations in multimedia networking and communications, their impact is mostlyconfined to academic circles. These methods are yet to find wide acceptancein industrial circles (possibly except in Japan), and get incorporated in manyindustrial products. This trend is also evident from the very small number ofindustrial patents in this direction. Hence, the main challenge of CI researchersis to provide the industry leaders a convincing demonstration of the superior-ity of their approaches over the traditional methods. Another challenge is todevelop methods compatible with existing standards, and new standards thatfacilitate CI-based implementations. Furthermore, since success of the fuzzymethods depends upon the compilation of a good knowledge base, gather-ing of rules of inference from experts remains a challenge for fuzzy systemsdesigners. Similarly, development of new types of neural networks, their train-ing algorithms, novel GAs preferably with parameters, e.g., population size,crossover and mutation rates) self-configurable by means of problem heuris-tics, and hybrid CI methods immensely suitable for the application problemat hand is always a challenge for the CI researchers.

Most of the current literature on CI based methods for multimedia commu-nications addresses the ATM network issues. A few papers deal with wirelessmultimedia. With the current trend towards IP based multimedia commu-nications in both wired-line and 3GB (third generation and beyond) mobilewireless networks, there is a need to develop CI-based methods for IP-basednetwork communications. Further, as is obvious from relatively much smallercoverage on multimedia communication aspects compared to that on network


control aspects in the current article, pure communication issues in multimediaand mobile multimedia are not that well addressed by the CI methods in thecurrent literature. The same applies to multimedia services and on-demandservices. New on-demand services may be designed by employing either newor existing CI based methods. Hence, exploration of CI methods for new ser-vices and communication methods will be a fruitful direction of research inthe future. Specific problems that have already been identified by the editorsof this volume in this context are: (i) multimedia semantic characteristics inwireless, mobile, and ubiquitous environments, (ii) extraction and usage ofsemantic information in wireless and mobile environments, (iii) multimediaretrieval in wireless and mobile networks, (iv) P2P multimedia streaming inwireless and mobile networks, and (v) performance evaluation of mobile mul-timedia services. Finally, from the perspective of the specific CI approachesthat need to be applied, explorations into possible applications of rough sets,and hybrids of neural, rough set, and fuzzy approaches to multimedia couldlead to new and interesting avenues of research.

References

1. Rumelhart D E, McClelland J L (1986) Parallel Distributed Processing: Explo-rations in the Microstructure of Cognition, volume 1. MIT Press, Massachusetts

2. Almeida L B (1987) Proceedings of the IEEE First International Conferenceon Neural Networks 11:609–618

3. Pineda F J (1987) Phys Rev Let 19:2229–22324. Hopfield J J (1982) Proceedings of the National Academy of Sciences of the

USA 79(8): 2554–25585. Binder K, Ising Model (2001) SpringerLink Encyclopaedia of Mathematics,

Springer6. Gelenbe E (1989) Neural Computing 1:502–5117. Kohonen T (1997) Self-organizing maps, 2nd Edition, Springer Verlag, Berlin

Heidelberg New York8. Sutton R S, Barto A G (1998) Reinforcement Learning: An Introduction. MIT

Press, Massachusetts9. Puterman M L (1994) Markov Decision Processes. Wiley, New York

10. Zadeh L A (1965) Information and Control 8: 338–35311. Mamdami E H (1974) Proceedings of the Institute of Electrical Engineers 121

(12):1585–158812. Pawlak Z (1991) Rough Sets: Theoretical Aspects of Reasoning About Data.

Kluwer, Dordrecht13. Dubois D (1990) International Journal of General Systems 17:191–20914. Stefanowski J (1998) On rough set based approaches to induction of decision

rules. In: Polkowski L, Skowron A (eds.) Rough Sets in Knowledge Discovery1: Methodology and Applications: 500–529, Physica-Verlag, Heidelberg

15. Ziarko W (1998) Rough sets as a methodology for data mining. In: PolkowskiL, Skowron A (eds.) Rough Sets in Knowledge Discovery 1: Methodology andApplications: 554–576, Physica-Verlag, Heidelberg

74 P. Guturu

16. Fogel L J, Owens A J, Walsh M J (1966) Artificial Intelligence Through Simu-lated Evolution. Wiley, New York

17. Holland J H (1975) Adaptation in natural and artificial systems. University ofMichigan Press, Ann Arbor

18. Rechenberg I (1973) Evolutionstrategie: Optimierung Technisher Systeme nachPrinzipien des Biologischen Evolution. Fromman-Hozlboog Verlag, Stuttgart

19. Schwefel H -P (1981) Numerical Optimization of Computer Models. John Wileyand Sons, New-York

20. Koza J R (1992) Genetic Programming: On the Programming of Computersby means of Natural Evolution. MIT Press, Massachusetts

21. Hiramatsu A (1990) IEEE Transactions on Neural Networks 1(1):122–13022. Hiramatsu A (1995) IEEE Communications Magazine 33(10):58, 63–6723. Youssef S A, Habib I W, Saadawi T N (1997) IEEE Journal on Selected Areas

in Communication (Special Issue on Computational and Intelligent Communi-cation) 15(2):191–199

24. Guerin R, Ahmadi H, Naghshineh M (1991) IEEE Journal on Selected Areasin Communication 9(7):968–981

25. Vakil F (1993) Proceedings of the IEEE GLOBECOM Conference 1993 (1):406–416

26. Uehara K, Hirota K (1997) IEEE Journal on Selected Areas in Communication(Special Issue on Computational and Intelligent Communication) 15(2):179–190

27. Bensaou B, Lam S T C, Chu H -W, Tsang D H K (1997) IEEE/ACM Trans-actions on Networking 5(4):572–584

28. Klir G J, Yuan B (1995) Fuzzy Sets and Fuzzy Logic: Theory and Applications.Prentice-Hall, New York

29. Ren Q, Ramamurthy G (2000) IEEE Journal on Selected Areas in Communi-cation 18(2):184–196

30. Liang Q, Karnik N N, Mendel J M (2000) IEEE Transactions on Systems, Man,And Cybernetics–Part C: Applications And Reviews 30(3):329–339

31. Cheng R -G, Chang C -J, Lin L -F (1999) IEEE/ACM Transactions on Net-working 7 (1):111–121

32. Chang C -J, Cheng R -G, Lu K -R, Lee H -Y (2000) Neural Fuzzy ConnectionAdmission Controller and Method in a Node of an Asynchronous Mode Transfer(ATM) Network. US Patent# 6067287

33. Cheng R -G, Chang C -J (1996) IEEE/ACM Transactions on Networking4(3):460–469

34. Kesidis G, Walrand J, Chang C -S (1993) IEEE/ACM Transactions on Net-working 1(4):424–428

35. ChengR-G,ChangC-J (1997)Proceedingsof IEECommunications144(2):93–9836. Chatovich A, Oktug S, and Dundar G (2001) Computer Communications

24:1031–104437. Berenji H R, Khedkar P (1992) IEEE Transactions on Neural Networks

3(5):724–74038. Shen S, Chung-Ju C, ChingYao H, Qi B (2004) IEEE Transactions on Wireless

Communications 3(5):1810–182139. Ahn C W, Ramakrishna R S (2004) IEEE Transactions on Vehicular Technology

53 (1):106–11740. Senouci S -M, Beylot A -L, Pujolle G (2004) International Journal on Network

Management 14:89–103


41. Ye J, Shen X(S), Mark J W (2005) IEEE Transactions On Mobile Computing4(2):129–141

42. Huang C Y, Yates R D (1996) Proceedings IEEE Vehicular Technology Con-ference ’96:1665–1669

43. Sun S, Krzymien W A (1998) Proceedings of the IEEE Vehicular TechnologyConference ’98:218–223

44. Sherif M R, Habib I W, Nagshineh M, Kermani P (2000) IEEE Journal onSelected Areas in Communications 18(2):268–282

45. Yuang M C, Tien P L (2000) IEEE Journal on Selected Areas in Communica-tions 18(9):1658–1669

46. Lo K -R, Chang C -J, Shung C B (2003) IEEE Transactions On VehicularTechnology 52 (5):1196–1206

47. Rappaport S S, Hu L R (1994) Proceedings of the IEEE 82(9):1383–139748. Lo K -R, Chang C -J, Chang C, Shung C B (1998) Computer Communications

21(13):1143–115249. Lo K -R, Chang C -J, Chang C, Shung C B (2000) IEEE Transactions on

Vehicular Technology 49(5):1588–159850. Moustafa M, Habib I, Naghshineh M N (2004) IEEE Transactions on Wireless

Communications 3(6):2385–239551. Fei Y, Wong V W S, Leung V C M (2006) Mobile Networks and Applications

11:101–11052. Hong D, Rappaport S S (1986) IEEE Transactions on Vehicular Technology

35(3):77–9253. Talukdar A K, Badrinath B R, Acharya A (1998) Proceedings ACM/IEEE

MobiCom’ 98:169–18054. Douligeris C, Develekos G (1997) IEEE Communications Magazine 35(5):

154–16255. Sekercioglu A, Pitsillides A, Vasilakos A (2001) Soft Computing Journal

5(4):257–26356. Hiramatsu A (1991) IEEE Journal on Selected Areas in Communications

9(7):1131–113857. Matyas J (1965) Automation and Remote Control 26:246–25358. Tarraf A A, Habib I W, Saadawi T N (1995) IEEE Communications Magazine

33 10):76–8259. Tarraf A A, Habib I W, Saadawi T N (1993) Proceedings of the IEEE GLOBE-

COM ’93(2):996–100060. Neves J E, de Almeida L B, Leitao M J (1994) Proceedings of the IEEE ICC

’94(2):769–77361. Tarraf A A, Habib I W, Saadawi T N (1994) IEEE Journal on Selected Areas

in Communications 12(6):1088–109662. Tarraf A A, Habib I W, Saadawi T N (1995) Proceedings of the IEEE ICC

’95(1):206–21063. Liu Y, Douligeris C (1995) Proceedings of the IEEE GLOBECOM

’95(1):291–29564. Pitsillides A, Sekercioglu Y A, Ramamurthy G (1997) IEEE Journal on Selected

Areas in Communications 15(2):209–22565. Roberts L (1994) Enhanced PRCA (proportional rate-control algorithm). Tech-

nical Report AF-TM 94-0735R1

76 P. Guturu

66. Lin C -T, Chung I -F, Pu H -C, Lee T -H, Jyh-Yeong Chang J -Y (2003)IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics32(6):832–845

67. Chen B -S, Yang Y -S, Lee B -K, Member, Lee T -H (2003) IEEE Transactionson Fuzzy Systems 11(4):568–581

68. Park Y -K, Lee G (1995) IEEE Communications Magazine 33(10):68–7469. Zhang Q, Leung Y -W (1999) IEEE Transactions on Evolutionary Computation

3(1):53–6270. Mao S, Hou Y T, Cheng X, Sherali H D, Midkiff S F, Zhang Y -Q (2006) IEEE

Transactions On Multimedia 8(5):1063–107471. Eppstein D (1999) SIAM Journal of Computing 28(2):652–67372. Papadimitratos P, Haas Z, Sirer E (2002) Proceedings of ACM Mobihoc:1–1173. Wang N, Pavlou G (2007) IEEE Transactions on Multimedia 9 (3):619–62874. Hwang F K, Richards D S, Winter P (1992) The Steiner Tree Problem. Elsevier,

North-Holland.75. Gelenbe E, Ghanwani A, Srinivasan V (1997) IEEE Journal on Selected Areas

in Communications 15(2):147–15576. Xia Z, Li P, Yen I-L (2004) Proceedings of the 18th International Parallel and

Distributed Processing Symposium:54–6377. Doulamis A D, Doulamis N D (2004) IEEE Transactions on Circuits and Sys-

tems for Video Technology 14(6):757–77578. Su X, Wah B W (2001) IEEE Transactions on Multimedia 3(1):123–13179. Bojkovic Z, Milovanovic D (2004) Seventh Seminar on Neural Network Appli-

cations in Electrical Engineering:67–7180. Mohamed S, Rubino G (2002) IEEE Transactions On Circuits And Systems

For Video Technology 12(12):1071–108381. Cramer C, Gelenbe E, Gelenbe P (1998) IEEE Potentials 17(1):29–3382. Ghinea G, Magoulas G D (2005) IEEE Transactions on Multimedia

7(6):1047–105383. Zhou Y, Murata T (1998) IEEE International Conference on Systems, Man,

and Cybernetics ’98(1):244–24984. Ali Z, Ghafoor A, Lee C S G (2000) IEEE Journal on Selected Areas in Com-

munications 18(2):168–18385. Jeon G, Kim D, Jeong J (2006) IEEE Transactions on Consumer Electronics

52 (4):1348–135586. Haan G D, Bellers E B (1998) Proceedings of the IEEE 86(9):1839–185787. Jeon G, Jeong J (2006) IEEE Transactions on Consumer Electronics

52(3):1013–102088. Shen X, Mark J W, Ye J (2000) Wireless Networks 6:363–37489. Tang W K S, Wong E W M, Chan S, Ko K -T (2004) IEEE Transactions on

Broadcasting 50(1):16–25 D.90. Bertsekas D, Gallager R (1992) Data Networks. Prentice-Hall, New York, p. 17991. Fonseca C M, Fleming P J (1998) IEEE Tranactions on Systems, Man, and

Cybernetics Part A: Systems and Humans 28(1):26–3792. Ali Z, Lee C S G, Ghafoor A (2000) IEEE International Conference on Fuzzy

Systems 9 (1):510–515

Part II

Computational Intelligence in 3D MultimediaVirtual Environment and Video Games

A Synthetic 3D Multimedia Environment

Ronald Genswaider1, Helmut Berger1, Michael Dittenbach1,Andreas Pesenhofer1, Dieter Merkl2, Andreas Rauber1,2, and Thomas Lidy2

1 E-Commerce Competence Center – EC3iSpaces Research GroupDonau-City-Strasse 1A-1220 Wien, [email protected], [email protected],[email protected], [email protected]

2 Department of Software Technology and Interactive SystemsVienna University of TechnologyFavoritenstrasse 9-11/188A-1040 Vienna, [email protected], [email protected],[email protected]

Summary. In this chapter we present The MediaSquare, a synthetic 3D multimediaenvironment we are currently developing. The MediaSquare enables users, imperson-ated as avatars, to browse and experience multimedia content by literally walkingthrough it. Users may engage in conversations with other users, exchange experi-ences as well as collectively explore and enjoy the featured content. The combinationof algorithms from the area of artificial intelligence with state-of-the-art 3D virtualenvironments creates an intuitive interface that provides access to manually as wellas automatically structured multimedia data while allowing to take advantage ofspatial metaphors.

1 Introduction

Millions of users interact, collaborate, socialize and form relationships witheach other through avatars in online environments such as Massively Multi-User Online Role-Playing Games (MMORPGs) [4,39,40]. While the predom-inant motivation to participate in MMORPGs is still “playing”, an increasingnumber of users is spending a significant amount of time in 3D virtualworlds without following a predefined quest. Generating, publishing and, mostimportantly, experiencing content in 3D virtual spaces is an emerging trendon the Internet with Second Life1 being the most prominent representative at

1 http://secondlife.com.

R. Genswaider et al.: A Synthetic 3D Multimedia Environment, Studies in Computational

Intelligence (SCI) 96, 79–98 (2008)


80 R. Genswaider et al.

the time of writing. On the one hand, such 3D virtual worlds address the as-pect of social interaction by providing instruments to interact and to exchangeexperiences with other users that go beyond the possibilities of conventionaltext-based chat rooms. Especially one’s inherent presence in space and theawareness of others facilitate the initiation of social contacts. On the otherhand, using 3D virtual worlds has the advantage of communicating via com-monly accepted spatial metaphors [13]. Similarity of objects can be expressedby spatial relations, i.e. the more similar two objects are, the closer they areplaced together. Furthermore, users can interpret each other’s interests by howclose they are to one another and to the objects in space. Having a commonpoint of reference and orientation within the virtual space as well as beingaware that other users can see one’s actions and objects in the same way, areimportant features regarding communication between users about particularlocations. Consequently, users are supported in building a mental model ofthe information space, to understand its characteristics and to grasp whichinformation is present and how the respective items relate to each other.

The MediaSquare, a synthetic 3D multimedia environment, takes ad-vantage of these spatial metaphors and allows users to explore multimediainformation that is structured and organized within space (cf. Fig. 1). Theinformation is either organized based on the actual content or by trans-forming a branch of a directory into architectural structures. Currently, TheMediaSquare implements the following scenarios. The first scenario, S1, is a 3DMusic Showroom that enables users to browse and listen to songs within thevirtual environment. To this end, acoustic characteristics are extracted frommusic tracks by applying methods from digital signal processing and psycho-acoustics. The features describe the stylistic facets of the music, e.g. beat,presence of voice, timbre, etc. and are used as features for the training of aself-organizing map that arranges similar music tracks in spatially adjacent

Fig. 1. The MediaSquare implements the following scenarios: S1, 3D Music Show-room; S2, 3D Image and Video Showroom; S3, 3D Scientific Library

A Synthetic 3D Multimedia Environment 81

regions. More precisely, the self-organizing map is an unsupervised neuralnetwork model that provides a topology-preserving mapping from a high-dimensional input space onto a 2D output space [18]. A second scenario, S2,aims at the implementation of a 3D Video and Image Showroom that allowsusers to experience content such as images or videos. To this end, character-istic features are extracted from the respective images or videos. The trainingof a self-organizing map is based on these features and, in analogy to the firstscenario, the resulting 2D map identifies the actual position of each particularimage or video source within the 3D Video and Image Showroom. This partic-ular scenario will be fully integrated in the final version of The MediaSquare.In the third scenario, S3, a 3D Scientific Library has been implemented. Thislibrary enables users to explore scientific documents such as posters or papersin this immersive 3D environment. On the one hand, a directory structure isused to create a room layout in which the content is presented. On the otherhand, characteristic text features are extracted from documents and are usedfor the training of a self-organizing map. Again, the resulting 2D map definesthe actual position of each document in the 3D representation.

In a nutshell, the main contribution of The MediaSquare is the realizationof an impressive showcase for combining state-of-the-art multimedia featureextraction approaches and unsupervised neural networks assembled in an im-mersive 3D multimedia content presentation environment.

The remainder of this chapter is organized as follows. In Sect. 2, anoverview about document clustering and digital libraries as well as musicinformation retrieval approaches and applications of 3D virtual environmentsis given. Self-organizing maps and the required feature extraction techniquesare outlined in Sect. 3. The system architecture of The MediaSquare is detailedin Sect. 4, followed by a description of the actual showcase in Sect. 5. Finally,in Sect. 6, some conclusions are given and an outlook on further research anddevelopment activities is provided.

2 Related Work

The design of user interfaces allowing the user to understand the contentsof a document archive as well as the results of a query plays a key role inmany digital library projects and has produced a number of different ap-proaches [1–3,14]. However, most designs rely on the existence of a descriptivetitle of a document to allow the user to understand the contents of the library,or use manual assignment of keywords to describe the topics of the collectionas used in the WEBSOM project [15], where units were labeled with the news-group that a majority of articles on a specific node came from. The LabelSOMmethod allows to automatically label the various areas of the library map withkeywords describing the topical sections based on the training results. Thisprovides the user with a clear overview of the contents of a SOM library mapsimilar to the maps provided at the entrance of conventional libraries [26].


The necessity to visualize information and the result of searches in digitallibraries has gained interest. A number of visualization techniques for informa-tion retrieval and information representation purposes was developed at XeroxPARC as part of the Information Visualization Project [33]. Information ispresented in a 3D space with the focus laid on the amount of information be-ing visible at one time and an easily understandable way of moving throughlarge information spaces. As one of the first examples of metaphor graphicsfor digital library visualization the Bookhouse project [29] may be mentioned,where the concept of a digital library is visualized using the representationof a library building with several rooms containing various sub-collectionsand icons representing a variety of search strategies. At the CNAM library,a virtual reality system was designed for the visualization of the antiquarianSartiaux Collection [8, 9], where the binding of each book is being scannedand mapped into a virtual 3D library to allow the user to experience the li-brary as realistically as possible. The Intelligent Digital Library [7] integratesa web-based visual environment for improving user-library interaction. An-other graphical, web-based tool for document classification visualization ispresented in [21]. While these methods address one or the other aspect ofdocument, library and information space visualization, none of these providesthe wealth of information presented by a physical object in a library, be it ahardcover book, a paperback or a video tape, with all the information that canbe intuitively told from its very looks. Furthermore, many of the approachesdescribed above require special purpose hardware, limiting their applicabil-ity as interfaces to digital libraries. The libViewer provides a flexible way ofvisualizing information on the documents in a digital library by representingmetadata in an intuitive way [31].

In the context of digital libraries only a few projects report on the ap-plication of collaborative multiuser 3D environments. Christoffel and Schmittdeveloped a virtual representation of the University Library Karlsruhe em-ploying the game engine used for the realization of Quake II [6]. In orderto provide users with a familiar environment, the 3D representation of thelibrary was modeled very similar to the real world counterpart. Especiallyyoung people were attracted by this environment since they seemed to followtheir “play-instinct”.

When shifting focus towards music, we witness the establishment of largemusic libraries which was supported by the emergence of powerful com-pression algorithms along with huge storage capabilities. These large mu-sic libraries require sophisticated search functionality that go beyond simplemetadata matching, such as artist, title, and genre. The query-by-hummingapproach introduced in the middle 1990s allows users to query songs by singingor humming melodies [12, 25]. Today, this technique has reached a maturestate and was implemented in the commercial online archive midomi.com.2

2 http://www.midomi.com.


Other algorithms addressing melodic structures are regular expression-stylequeries [10] or query-by-example techniques to find cover versions of a musictrack [37].

Other applications allow users to explore areas of related music instead ofquerying titles they already know. Torrens proposed three different visual rep-resentations for private music collections using genres to create sub-sectionsand the date of the tracks for sorting them [36]. Other works analyze thesound data for characteristic features and use a SOM to represent acousti-cally similar tracks. The PlaySOM and the PocketSOMPlayer [28] designedfor small devices such as palmtops and mobile phones, allow users to generateplaylists by marking areas on a map of music. Knees transformed the land-scape into a 3D view and enriched the units of the SOM by images relatedto the music found on the Internet [16]. Besides SonicSOM which follows theformer principle, Lubbers proposed SonicRadar, a graphical interface compa-rable to a radar screen. The center of this screen is the actual view-point ofthe listener [22]. By turning around, users can hear multiple neighboring mu-sic titles, panning and loudness of the sounds describe their position relativeto the user. Tzanetakis and Cook introduced Marsyas3D, an audio browserand editor for collaborative work on large sound collections [38]. A large-scalemultiuser screen offers several 2D as well as 3D interfaces to browse for soundfiles which are grouped by different sound characteristics. The MUSICtableprovides a collaborative interface on a tabletop display that invites all partic-ipants to select music tracks in a playful manner [35].

3 Spatial Content Organization

3.1 Self-Organizing Map

The self-organizing map (SOM) is a general unsupervised tool for orderingof high-dimensional data in such a way that similar instances are groupedspatially close to one another [17, 18]. The model consists of a number ofneural processing elements, i.e. units. These units are arranged according tosome topology where the most common choice is marked by a 2D grid. Eachof the units i is assigned an n-dimensional weight vector mi, mi ∈ Rn. It isimportant to note that the weight vectors have the same dimensionality asthe instances.

The training process of self-organizing maps may be described in terms ofinstance presentation and weight vector adaptation. Each training iterationt starts with the random selection of one instance x, x ∈ X and X ⊆ Rn.This instance is presented to the self-organizing map and each unit deter-mines its activation. Usually, the Euclidean distance between the weight vec-tor and the instance is used to calculate a unit’s activation. In this particularcase, the unit with the lowest activation is referred to as the winner, c. Finally,the weight vector of the winner as well as the weight vectors of selected units


in the vicinity of the winner are adapted. This adaptation is implementedas a gradual reduction of the difference between corresponding componentsof the instance and the weight vector, as shown in (1). Note that we use adiscrete-time notation with t denoting the current training iteration.

mi(t + 1) = mi(t) + α(t) · hci(t) · [x(t) − mi(t)]. (1)

The weight vectors of the adapted units are moved slightly towards the in-stance. The amount of weight vector movement is guided by the learningrate, α, which decreases over time. The number of units that are affected byadaptation as well as the strength of adaptation depending on a unit’s dis-tance from the winner is determined by the neighborhood function, hci. Thisnumber of units also decreases over time such that towards the end of thetraining process only the winner is adapted. The neighborhood function isunimodal, symmetric and monotonically decreasing with increasing distanceto the winner, e.g. Gaussian.

The movement of weight vectors has the consequence that the Euclideandistance between instances and weight vectors decreases. So, the weight vec-tors become more similar to the instance. Hence, the respective unit is morelikely to win at future presentations of this instance. The consequence ofadapting not only the winner but also a number of units in the neighborhoodof the winner leads to a spatial clustering of similar instances in neighboringparts of the self-organizing map. Existing similarities between instances inthe n-dimensional input space are reflected within the 2D output space of theself-organizing map. In other words, the training process of the self-organizingmap describes a topology preserving mapping from a high-dimensional inputspace onto a 2D output space. Such a mapping ensures that instances, whichare similar in terms of the input space, are represented in spatially adjacentregions of the output space.

3.2 Text Feature Extraction

In order to use the SOM for organizing documents based on their topics, avector-based description of the content of the documents needs to be created.While manually or semi-automatically extracted content descriptors may beused, research results have shown that a rather simple word frequency baseddescription is sufficient to provide the necessary information in a very stableway [5,19,27,31]. For this word frequency based representation a vector struc-ture is created consisting of all words appearing in the document collection.Stop words, i.e. words that do not contribute to content representation andtopic discrimination between documents, are usually removed from this listof words. Again, while manually crafted stop word lists may be used, sim-ple statistics allow the removal of most stop words in a very convenient andlanguage- and subject-independent way. On the one hand, words appearingin too many documents, say, in more than half of all documents, can be re-moved without the risk of losing content information, as the content conveyed


by these words is too general. On the other hand, words appearing in only asmall number of documents can be omitted for content-based classification, asthe resulting sub-topic granularity would be too small to form a topical clusterin its own right. Note that the situation is different in the information retrievaldomain, where rather specific terms need to be indexed to facilitate retrievalof a very specific subset of documents. In this respect, content-based organiza-tion and browsing of documents constitutes a conceptually different approachto accessing document archives and interacting with them by browsing topicalhierarchies. This obviously has to be supplemented by various searching facil-ities, including information retrieval capabilities as they are currently realizedin many systems.

The documents are described by the words they are made up of within theresulting feature space, usually consisting of thousands of dimensions, i.e. dis-tinct terms. While basic binary indexing may be used to describe the contentof a document by simply stating whether or not a word appears in the docu-ment, more sophisticated schemes such as tf × idf , i.e. term frequency timesinverse document frequency [34], provide a better content representation. Thisweighting scheme assigns higher values to terms that appear frequently withina document, i.e. have a high term frequency, yet rarely within the completecollection, i.e. have a low document frequency. Usually, the document vectorsare normalized to unit length to make up for length differences of the variousdocuments.

3.3 Audio Feature Extraction

Content-based access to audio files, particularly music, requires the develop-ment of feature extraction techniques that capture the acoustic characteristicsof the signal, and that allow the computation of similarity between pieces ofresembling the acoustic similarity perceived by a listener. A feature set suit-able for describing acoustic characteristics in music are Rhythm Patterns.The algorithm was first introduced in [30] and improved later by the inclu-sion of psycho-acoustic transformations in [32]. The feature set has proven tobe applicable to both classification of music into genres [20] and automaticclustering of music archives according to the perceived sound similarity [23].A Rhythm Pattern describes fluctuations on critical frequency bands of thehuman auditory range and thus reflects the rhythmical structure of a piece ofmusic. The algorithm for extracting a Rhythm Pattern is a two stage process:First, from the spectral data the specific loudness sensation in Sone is com-puted for 24 critical frequency bands. Second, this Sonogram is transformedinto a time-invariant domain resulting in a representation of modulation am-plitudes per modulation frequency.

In more detail, in the first part, a Short Time Fourier Transform (STFT) isapplied to compute a Spectrogram, whose frequency bands are then groupedaccording to the Bark scale into 24 psycho-acoustically motivated criti-cal bands. Successively, the Bark-scale Spectrogram is transformed into the


decibel, Phon and Sone scales, resulting in a power spectrum that reflectshuman loudness sensation. In the second part, by applying a Fourier Trans-form (FFT) the spectrum is transformed into a time-invariant representationshowing the magnitude of amplitude modulations for different modulation fre-quencies on the 24 critical bands. These amplitude modulations have differenteffects on the human hearing sensation depending on their frequency. Themost significant is referred to as fluctuation strength, which is most intenseat 4 Hz and decreasing towards 15 Hz. Consequently, a fluctuation strengthweighting curve is applied, followed by a gradient filter and Gaussian smooth-ing, to improve resemblance between two Rhythm Patterns.

A Rhythm Pattern is typically computed for every third segment of 6 slength in a song, and the feature set for a song is computed by taking themedian of multiple Rhythm Patterns. A Rhythm Pattern constitutes a com-parable representation of a song, which may be used to compute the similaritybetween two songs using a distance measure such as the Euclidean distanceor any other metric. Thus, the RP may be used as input for a self-organizingmap in order to automatically compute a similarity-based organization of amusic collection.

4 The System Architecture

The goal of The MediaSquare is to provide a 3D virtual environment allowingmultiple users to explore large multimedia repositories such as music or textcollections as well as video or image galleries. The underlying system architec-ture is depicted in Fig. 2, and the technological building blocks are described inthe following section. The core of the system is the Torque3 game engine thatis designed according to a strict client-server architecture. When the Torquegame engine is executed on a single computer both, client and server, arehandled by the same machine. Communication between client and server isenabled by means of a very robust networking protocol which allows accurateupdate rates even over low bandwidth Internet connections. The Torque serveris responsible for the execution of the virtual environment. This includes theinstantiation of the environment on startup and the coordination of objectsand users. The Torque client is mainly responsible for the user interface andthe audiovisual representation of the virtual environment. The “game” logicis written in TorqueScript which is compiled into byte-code before processing.Additionally, the source code of the engine can be adapted and extended torealize more complex and time-critical tasks. 3D objects, textures and soundfiles are stored in a special folder that resides on both, the server and the client.Whenever a client connects to the server the version of each file is checkedand, if necessary, updated files are automatically downloaded to the client.

3 http://www.garagegames.com.


Fig. 2. The system architecture

The Torque game engine runs on all major operating systems. It provides acomprehensive set of design and development tools including a World Editor,a GUI Editor and a Terrain Editor, which assist perfectly during the creationof arbitrary games. Moreover, it offers multi-player network code, seamlessindoor/outdoor rendering engines, state of the art skeletal animation, drag anddrop GUI creation, and a C-like scripting language. For a smooth execution,Torque requires an Intel Pentium 4 processor, 128 MB RAM with an OpenGLcompatible 3D graphics accelerator card. In addition to that and unlike mostcommercial game engines, the source code of the engine is distributed as part


Fig. 3. Directory-based mapping of a slide show

of the low cost royalty-free licensing policy, which facilitates the creation ofthe 3D multimedia environment The MediaSquare.

The system is designed to enable access to various media types, namelyaudio, image and video, as well as text. To this end, the system offers directory-based mapping and automatic content-based organization of multimedia data.In order to integrate the respective data items into the environment, a num-ber of preprocessing steps need to be applied. In the case of directory-basedmapping, the data is manually organized according to a predefined directorystructure as depicted in Fig. 3. The first-level elements in the directory struc-ture are folders grouping related data. On the second level in the directorystructure, i.e. in sub-folders, the actual data is stored. Consider, for example,three slide shows consisting of five, six and four slides respectively (cf. Fig. 3).In this case, the slides of each particular slide show are converted into a setof separate image files, which are, in turn, used as textures of presentationscreens in the virtual environment. Every folder, regardless of its hierarchylevel, contains a metadata information file describing the content of the folderin terms of title, author, date and the number of sub-elements.

In case of content-based organization, self-organizing maps are employedto automatically structure media data. Depending on the media type the cor-responding feature extraction technique is selected, i.e. when text documentsare processed, the term-based feature extraction approach is used and in caseof music files, the audio feature extraction approach is employed. These fea-tures are used as input for the self-organizing map algorithm. The resulting2D map is described in terms of a Unit file, which determines the final positionof each data item along with the dimensions of the map.

The Torque Mission file is used to specify the characteristics of the vir-tual environment. This includes properties as, for instance, the topology ofthe terrain, the positions of static objects such as buildings as well as in-teriors, and environmental entities such as the sun or the sky. In order to


Fig. 4. Segmentation of a SOM by means of Marker areas

enable access to the media within the virtual world, Marker areas describingdesignated places in the virtual world, which specify the position of the me-dia, are created. These Marker areas are rectangular-shaped objects that areinvisible during runtime. Additionally, these objects contain properties thatspecify the underlying SOM and which specific parts thereof are selected fora particular area in the environment. On the one hand, this allows for therepresentation of more than one SOM in the virtual environment and, on theother hand, it enables to segment a single SOM. Figure 4 depicts the segmen-tation process of a SOM by means of Marker areas. In particular, a 4×4 SOMis divided into two segments consisting of two 2 × 4 SOMs respectively. Eachsegment is mapped onto one room in the virtual environment, i.e. Segment 1is mapped onto Room 1 and Segment 2 is mapped onto Room 2. In the con-text of directory-based mapping, Marker areas indicate the position where theautomatic lay out process of corresponding objects starts off. The propertiesof the Marker area object specify the URI of the associated directory andthe algorithm to lay out the graphical representations of the directory. Threedifferent algorithms to lay out the directory structure are provided. The linearlay out algorithm aligns objects such as buildings, along a straight path. Thegenerated layout is comparable to a residential area with detached houses.When using the circular lay out algorithm, objects are arranged along a cir-cle with radius r. In case of the matrix-style layout, buildings are arrangedsimilar to a checker board whereas the distance between the buildings can befreely defined.

The Repository contains templates of objects that can be used to visuallyrepresent the media in the virtual environment. More precisely, a template fora SOM unit consists of interiors, a label describing the media and an objectthat graphically represents the media. In case of directory-based mapping,


the template is a container for multiple media objects including labels for thecontainer and the media itself. These templates are created with the TorqueWorld Editor and stored in the Repository.

A Wrapper application processes the Unit files generated by the SOM aswell as the manually compiled directory. The Wrapper scans the Mission filefor Marker areas and imports the associated templates from the Repository.Then it loads the SOM-generated Unit files as well as the manually generateddirectory hierarchies into its internal object structure. It calculates the posi-tion and rotation of each media object. Subsequently, the Wrapper writes theinformation about all representative objects into an Objects file and createsthe Playlists for the Icecast4 multimedia streaming server. Since Torque doesnot provide network based streaming of audio files it was necessary to adaptthe audio emitter class of the game engine. The new implementation takesadvantage of FMOD,5 a very flexible sound API that offers native supportfor HTTP-streaming. Objects instantiating this audio emitter class may bepositioned in the virtual environment and enable broadcasting of MP3 audiostreams as spatial sound. On the client side, an internal audio manager con-tinuously determines the distances between the user’s avatar and all audiosources in its vicinity. Only those audio sources that are closer than a certaindistance to the user, are actually streaming music.

Automatically generated textures are used to label the media. In case ofdirectory-based mapping, the information files are used to generate labelsdescribing the contents of each directory. For the content-based organizationthe information stored in the SOM Unit files is used to determine the labels’descriptions. When the featured media is music, each corresponding audioemitter object is labeled with a playlist that is automatically created by ex-tracting the ID3-tags from the music files.

When the Torque server starts up, it loads the Mission file and createsthe 3D environment. Then it processes the Objects file and dynamically addsall media objects. In parallel, the Icecast server is started enabling audiostreaming in the virtual environment. After that, the system is up and runningand ready to accept connections from Torque clients.

5 The MediaSquare

The current implementation of The MediaSquare covers two scenarios of mul-timedia content presentation. First, a 3D Music Showroom providing accessto a music collection which was automatically organized based on the audiocontent was implemented and, second, a 3D Scientific Library was realized.This library enables users to explore scientific documents such as posters orpapers in this immersive 3D environment. On the one hand, a directory struc-ture was used to create a room structure in which the content is presented. On

4 http://www.icecast.org.5 http://www.fmod.de.


the other hand, characteristic text features were extracted from documentsand were used for the training of a self-organizing map.

In order to participate in The MediaSquare the client application6 needsto be downloaded and installed. On the start screen users can either changethe display settings or proceed to the virtual world. When clicking on thestart-button, the user can select her favorite avatar and enter a name. Afterthat, the avatar is placed right in the center of The MediaSquare. The avataris navigated by means of the keyboard and its viewpoint is controlled via themouse. On pressing F4, a little chat window appears. When pressing “c” aconversation with other users can be started. Everything that is said will beheard by others in the vicinity.

In the 3D Music Showroom users can listen to streamed music originatingfrom loudspeakers on coffee tables (cf. Fig. 5). The music collection used inThe MediaSquare is from Magnatune,7 which is distributed under the cre-ative commons license for non-commercial use. This particular collection con-tains about 1,500 MP3 files featuring the genres classical, electronic, jazz,blues, pop, rock, metal, punk and world music. Since the Icecast multime-dia streaming server broadcasts music like a radio station, it is ensured thatall users are listening to the same music track when at the same location.Depending on the user’s position relative to the audio sources, one or morespatialized audio streams are audible. When the user’s avatar is close to anaudio source a head-up display (HUD) shows the currently playing track aswell as the corresponding playlist. This HUD can be toggled with the key F6.

Fig. 5. 3D Music Showroom with head-up display

6 Available for download at http://mediasquare.ec3.at.7 http://magnatune.com.


On the lower left of the screen, detailed information about the currently play-ing track is displayed. When clicking the left mouse button the audio streamof the respective source skips to the next track and all users close to thisparticular source will notice the change.

When reflecting on the above, alternative implementations of the sameconcept can be considered. For example, a collection of music that is auto-matically organized by means of sound similarity and visualized as a musicstore. Replacing coffee tables with shelves and playlist menus with CDs re-sults in a visual representation similar to its real-life counterpart. So, thesame principles of content organization can be employed whereas the result-ing visualization, user interaction and means for product consumption, differcompletely from the original implementation.

The 3D Scientific Library of The MediaSquare provides access to the scien-tific results of the EU FP6 Network of Excellence on Multimedia Understand-ing through Semantics, Computation, and Learning (MUSCLE8). In this case,a directory structure is used to create a layout of rooms in which the contentis presented. The presentations are grouped according to different scientificmeetings and each directory contains several presentations that have beengiven there. The information file associated with each meeting describes thelocation and the date it was held. In case of a presentation the information filecontains the title and the authors’ names. This directory structure is mappedinto the 3D virtual environment according to the circular lay out algorithm.As a result, the environment contains several buildings whereof each repre-sents a particular meeting as shown in Fig. 6. Labels describing the meetings’

Fig. 6. Circular mapping of a directory structure in the 3D Scientific Library

8 http://www.muscle-noe.org.


Fig. 7. Floor plan of the building (left) and enlargement of main hall (right) withSOM unit positions and topic labels

locations and dates are placed at the corresponding entrances. These buildingsfeature presentation screens that are attached to the walls and are used fordata visualization. Labels containing the metadata are attached on the exactopposite of each presentation screen. The textures of the screens change inpredefined time intervals.

In another building automatically organized scientific posters are pre-sented. In particular, the poster contributions to the WWW conference in2006 are on display. Figure 7 shows the floor plan of the building. It con-sists of a main hall and a gallery that can be reached via stairs. The postercontributions are arranged according to the mapping of a rectangular SOMconsisting of 4× 3 units (main hall) and a U-shaped MnemonicSOM [24] thatfits the gallery. The enlargement of the floor plan shows the layout of theposter topics. As an example, posters dealing with ontologies, Semantic Weband according technologies and standards are located in the top row. In themiddle row, we find well-separated clusters dealing with Semantic Web appli-cations for museums, social Internet as well as more technical contributionsin the area of information retrieval and Web crawling. In the bottom rowthe topics of association rule mining, clustering, news feed analysis and linkanalysis are located. A screenshot depicting a slightly elevated view from thebottom left corner of the room is presented in Fig. 8. The different sizes ofthe poster stands are automatically determined and depend on the number ofposters assigned to the respective SOM units.


Fig. 8. View from the bottom left corner of the main hall of the 3D Scientific Library

6 Conclusions

In this chapter, we have described The MediaSquare, a synthetic 3D multi-media environment that allows multiple users to collectively explore multi-media data and interact with each other. The data is organized within the3D virtual world either based on content similarity, or by mapping a givenstructure (e.g. a branch of a file system hierarchy) into a room structure.With this system it is possible to take advantage of spatial metaphors suchas relations between items in space, proximity and action, common referenceand orientation, as well as reciprocity. In this context it is essential to re-fer to Friedman [11]. He emphasizes in his seminal book on the phenomenonof the flattening of the globe, that the world is literally becoming smaller.This “shrinking” is caused by the lightning-swift advances in technology andcommunications which put people all over the globe in touch as never be-fore. Environments such as The MediaSquare support this trend by allowinggeographically separated individuals to immerse into a collaborative virtualenvironment, interact with each other and collectively experience the featuredcontent. In a nutshell, The MediaSquare presents an impressive showcase forcombining state-of-the-art multimedia feature extraction approaches and un-supervised neural networks assembled in an immersive 3D multimedia contentpresentation environment.

Current approaches (cf. Sect. 2) for the visualization of and interactionwith large data collections mainly focus on single media or document typesand their arrangement within space. The MediaSquare, however, provides ac-cess to several media types in one integrated environment. As of now, theshowcase comprises audio, images, and text documents arranged by means


of an automatic content-based clustering algorithm. Additionally, The Medi-aSquare transcends other approaches by providing an environment for users,which fosters the social interaction while collaboratively experiencing the con-tent on display.

Future work includes improved user interface capabilities, a tighter inte-gration of the single components of the system as well as the integration ofadditional feature extraction modules for other media types. Moreover, thesecond scenario, S2, aiming at the realization of a 3D Video and Image Show-room, will be implemented. To this end, methods for extracting characteristicfeatures from the respective images or videos need to be included. The trainingof a self-organizing map will be based on these features and, in analogy to thefirst scenario, the resulting 2D map will identify the actual position of eachparticular image or video source within the 3D Video and Image Showroom.

Acknowledgments

This work was partially funded by the Austrian Federal Ministry of Economicsand Labour under the kind research program and the MUSCLE Network ofExcellence (project reference: 507752).

References

1. H. Ahonen, O. Heinonen, M. Klemettinen, and A.I. Verkamo. Applying datamining techniques for descriptive phrase extraction in digital documents. InProceedings of the Advances in Digital Libraries Conference (ADL98), page 2,Santa Barbara, CA, 1998. IEEE Computer Society.

2. R.B. Allen, P. Obry, and M. Littman. An interface for navigating clustereddocument sets returned by queries. In Proceedings of the Conference on Orga-nizational Computing Systems (COCS93), pages 166–171, Milpitas, CA, 1993.ACM.

3. M.Q. Baldonado and T. Winograd. Sensemaker: An information-exploration in-terface supporting the contextual evolution of a user’s interests. In Proceedings ofthe SIGCHI Conference on Human Factors in Computing Systems, pages 11–18,Atlanta, GA, 1997. ACM.

4. E. Castronova. Synthetic Worlds: The Business and Culture of Online Games.University of Chicago Press, Chicago, IL, 2005.

5. H. Chen, C. Schuffels, and R. Orwig. Internet categorization and search: Aself-organizing approach. Journal of Visual Communication and Image Repre-sentation, 7(1):88–102, 1996.

6. M. Christoffel and B. Schmitt. Accessing libraries as easy as a game. In JCDL2002 Workshop: Visual Interfaces to Digital Libraries, pages 25–38, London,UK, 2002. Springer.

7. M.F. Costabile, F. Esposito, G. Semeraro, N. Fanizzi, and S. Ferilli. Interactingwith idl: The adaptive visual interface. In Proceedings of the Second EuropeanConference on Research and Advanced Technology for Digital Libraries, pages515–534, Heraklion, Greece, 1998. Springer.


8. P. Cubaud, J. Dupire, and A. Topol. Fluid interaction for the document incontext. In Proceedings of the 2007 Conference on Digital Libraries (JCDL’07),page 504, Vancouver, Canada, 2007. ACM.

9. P. Cubaud, C. Thiria, and A. Topol. Experimenting a 3D interface for the accessto a digital library. In Proceedings of the Third ACM Conference on DigitalLibraries (DL’98), pages 281–382, Pittsburgh, PA, 1998. ACM.

10. M.J. Dovey. A technique for “regular expression” style searching in polyphonicmusic. In Proceedings of the International Symposium on Music InformationRetrieval (ISMIR 2001), 2001.

11. T.L. Friedman. The World is Flat: A Brief History of the Twenty-First Century.Farrar, Straus and Giroux, New York, 2005.

12. A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming:musical information retrieval in an audio database. In Proceedings of the ThirdACM International Conference on Multimedia, pages 231–236, New York, NY,USA, 1995. ACM.

13. S. Greenberg and M. Roseman. Using a room metaphor to ease transitions ingroupware. In M. Ackerman, V. Pipek, and V. Wulf, editors, Sharing Exper-tise: Beyond Knowledge Management, pages 203–256. MIT, Cambridge, MA,January 2003.

14. M. Hearst and C. Karadi. Cat-a-cone: an interactive interface for specifyingsearches and viewing retrieval results using a large category hierarchy. SIGIRForum, 31(SI):246–255, 1997.

15. T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WEBSOM – Self-organizingmaps of document collections. In Proceedings of the Workshop on Self-Organizing Maps (WSOM97), Espoo, Finland, 1997.

16. P. Knees, M. Schedl, T. Pohle, and G. Widmer. An innovative three-dimensionaluser interface for exploring music collections enriched with meta-informationfrom the web. In Proceedings of the 14th Annual ACM International Conferenceon Multimedia, pages 17–24. ACM, 2006.

17. T. Kohonen. Self-organized formation of topologically correct feature maps.Biological Cybernetics, 43, 1982.

18. T. Kohonen. Self-Organizing Maps. Springer, Berlin Heidelberg New York,1995.

19. T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, andA. Saarela. Self-organization of a massive document collection. IEEE Transac-tions on Neural Networks, 11(3):574–585, May 2000.

20. T. Lidy and A. Rauber. Evaluation of feature extractors and psycho-acoustictransformations for music genre classification. In Proceedings of the Inter-national Conference on Music Information Retrieval (ISMIR), pages 34–41,London, UK, September 11–15, 2005.

21. Y. Liu, P. Dantzig, M. Sachs, J. Corey, M. Hinnebusch, T. Sullivan,M. Damashek, and J. Cohen. Visualizing document classification: A search aidfor the digital library. In Proceedings of the European Conference on DigitalLibraries, Heraklion, Greece, 1998.

22. D. Lubbers. SoniXplorer: Combining visualization and auralization for content-based exploration of music collections. In Proceedings of the 6th InternationalConference on Music Information Retrieval (ISMIR 2005), pages 590–593, 2005.

23. R. Mayer, T. Lidy, and A. Rauber. The map of mozart. In Proceedings of theInternational Conference on Music Information Retrieval (ISMIR), Victoria,Canada, October 8–12, 2006.


24. R. Mayer, D. Merkl, and A. Rauber. Mnemonic SOMs: Recognizable shapes forself-organizing maps. In M. Cottrell, editor, Proceedings of the Fifth Workshopon Self-Organizing Maps (WSOM’05), pages 131–138, Paris, France, September5–8 2005.

25. R. J. McNab, L. A. Smith, I. H. Witten, C. L. Henderson, and S. J. Cunning-ham. Towards the digital music library: tune retrieval from acoustic input.In Proceedings of the First ACM International Conference on Digital Libraries(DL ’96), pages 11–18, New York, NY, USA, 1996. ACM.

26. D. Merkl and A. Rauber. Automatic labeling of self-organizing maps for in-formation retrieval. In Proceedings of the International Conference on NeuralInformation Processing (ICONIP’99), Perth, WA, 1999.

27. D. Merkl and A. Rauber. Document classification with unsupervised neuralnetworks. In F. Crestani and G. Pasi, editors, Soft Computing in InformationRetrieval, pages 102–121. Physica, 2000.

28. R. Neumayer, M. Dittenbach, and A. Rauber. PlaySOM and PocketSOMPlayer,Alternative interfaces to large music collections. In Proceedings of the 6th In-ternational Conference on Music Information Retrieval (ISMIR 2005), pages618–623, 2005.

29. A. Pejtersen. A library system for information retrieval based on cognitive taskanalysis and supported by an icon-based interface. In Proceedings of the An-nual ACM SIGIR Conference on Research and Developement in InformationRetrieval (SIGIR’89), 1989.

30. A. Rauber and M. Fruhwirth. Automatically analyzing and organizing musicarchives. In Proceedings of the 5th European Conference on Research and Ad-vanced Technology for Digital Libraries (ECDL 2001), Springer Lecture Notesin Computer Science, Darmstadt, Germany, September 4–8, 2001. Springer.

31. A. Rauber and D. Merkl. Text mining in the SOMLib digital library system:The representation of topics and genres. Applied Intelligence, 18(3):271–293,2003.

32. A. Rauber, E. Pampalk, and D. Merkl. Using psycho-acoustic models and self-organizing maps to create a hierarchical structuring of music by musical styles.In Proceedings of the 3rd International Symposium on Music Information Re-trieval (ISMIR 2002), pages 71–80, Paris, France, October 13–17 2002.

33. G. Robertson, S. Card, and J. Mackinlay. Information visualization using 3Dinteractive animation. Communications of the ACM, 36(4):57–71, 1993.

34. G. Salton. Automatic Text Processing: The Transformation, Analysis, andRetrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.

35. I. Stavness, J. Gluck, L. Vilhan, and S. Fels. The MUSICtable: A map-basedubiquitous system for social interaction with a digital music collection. InProceedings of the 4th International Conference on Entertainment Computing(ICEC 2005), 2005.

36. M. Torrens, P. Hertzog, and J.-L. Arcos. Visualizing and exploring personalmusic libraries. In Proceedings of the 5th International Conference on MusicInformation Retrieval (ISMIR 2004), 2004.

37. W.-H. Tsai. A query-by-example technique for retrieving cover versions ofpopular songs with similar melodies. In Proceedings of the 6th InternationalConference on Music Information Retrieval (ISMIR 2005), pages 183–190, 2005.

38. G. Tzanetakis and P. Cook. Marsyas3D: A prototype audio browser-editor usinga large scale immersive visual and audio display. In Proceedings of the Interna-tional Conference on Auditory Display, 2001.


39. B.S. Woodcock. An analysis of mmog subscription growth. http://www.

mmogchart.com/.40. N. Yee. The psychology of massively multi-user online role-playing games:

Emotional investment, motivations, relationship formation, and problematicusage. In R. Schroeder and A. Axelsson, editors, Avatars at Work and Play:Collaboration and Interaction in Shared Virtual Environments, volume 34 ofComputer Supported Cooperative Work. Springer, Heidelberg, Germany, 2005.

Robotics and Virtual Reality: A Marriageof Two Diverse Streams of Science

Tauseef Gulrez1, Manolya Kavakli1, and Alessandro Tognetti2

1 Virtual Interactive Simulations of Reality (VISOR) Research Group,Department of Computing, Division of Information and CommunicationSciences, Macquarie University, Sydney, NSW 2109, AustraliaCorresponding Author: [email protected]

2 Interdepartmental Research Center “E. Piaggio”, Faculty of Engineering,University of Pisa, Italy

Summary. In an immersive computationally intelligent virtual reality (VR) envi-ronment, humans can interact with a virtual 3D scene and navigate a robotic device.The non-destructive nature of VR makes it an ideal testbed for many applicationsand a prime candidate for use in rehabilitation robotics simulation and patient train-ing. We have developed a testbed for robot mediated neurorehabilitation therapythat combines the use of robotics, computationally intelligent virtual reality andhaptic interfaces. We have employed the theories of neuroscience and rehabilitationto develop methods for the treatment of neurological injuries such as stroke, spinalcord injury, and traumatic brain injury. As a sensor input we have used two state-of-the-art technologies, depicting the two different approaches to solve the mobilityloss problem. In our first experiment we have used a 52 piezoresistive sensor ladenshirt as an input device to capture the residual signals arising from the patient’sbody. In our second experiment, we have used a precision position tracking (PPT)system to capture the same signals from the patient’s upper body movement. Thekey challenge in both of these experiments was to accurately localise the movementof the object in reality and map its corresponding position in 3D VR. In this bookchapter, we describe the basic theory of the development phase and of the operationof the complete system. We also present some preliminary results obtained fromsubjects using upper body postures to control the simulated wheelchair.

1 Introduction

Vision is considered the most effective sensor of the human body. In schoolwe learnt that 70% of human body sensing relies solely upon human vision(eyes). In our view, human perception is heavily correlated with the humanexperiences built over time. It would not be wrong to say that humans build arepository of their life experiences over time, similar to a computers program’sdatabase. This repository of information is also known as Internal Models of

T. Gulrez et al.: Robotics and Virtual Reality: A Marriage of Two Diverse Streams of Science,

Studies in Computational Intelligence (SCI) 96, 99–118 (2008)


100 T. Gulrez et al.

the Human Mind [16, 21, 32, 40, 41, 44, 58, 63]. Life experiences are temporar-ily stored as information in the hippocampus area of the human brain andlater shifted to the neocortex area of the brain. Obviously not all experienceinformation is shifted to neocortex, but the most significant information fromour experiences are stored as a function file. Later, whenever human beingswant to evaluate a new experience, they recall those repository functions andestimate their predictions based upon the new experiences. Upon getting theresults they correct their estimates with the innovation gains just like BayesRule [8], i.e.

Posterior ∝ Likelihood × InternalModels. (1)

This idea of relating Bayes Rule to the human mind has been shown by manyresearchers [7, 23,29,32,61,63].

The question that arises from the above discussion is whether or not it ispossible to deceive the human central nervous [46] system by showing artificialsurrounding (virtual immersive) scenes, generated through sophisticated com-puterised projectors, which can show the context similar to the human mindrepository information (internal models). The idea is to make the human mindfeel that the computerised projection of the virtual environment is exactly thesame as what is experienced everyday, e.g. a simulated office environment orhouse that human beings are familiar with. Although 70% of human sensingcomes from vision, there is another 30% composed of other sensing organssuch as touch, smell, feel and most importantly the human vestibular system.The human vestibular system enables us to orientate ourselves, and it alsogives us a feel about acceleration and velocity whilst sitting in a moving caror train, for example. With virtual reality technology, we can make a humanbeing feel that he is immersed in an artificial reality environment, but it is dif-ficult to deceive the other 30% of sensory receptors in the same environment.For example, in a virtually projected garden with pathways and flower beds,the lack of fragrance and a cool breeze will eliminate the illusion of being ina real garden, since human beings have never encountered a garden withoutthe smell of flowers, etc. The human internal model has to be altered in orderfor humans to feel the sense of the fabricated virtual environment, in this casea garden. Presently, the only option available to recreate the virtual environ-ment is through a fully immersive virtual reality environment, secondly wecan incorporate sensing devices like data gloves or motion tracking systemsto make the user interactive in that environment.

The rest of the book chapter is organised as follows, in the next few para-graphs we have described the advances happening in the field of computation-ally intelligent VR, Robotics followed by disability around the world and anoverview of human brain anatomy. In the second section we have describedthe novel sensor shirt and the precision position tracking system interfacedin the computationally intelligent VR systems. In third section we have des-cribed the description of prototype testbed for stroke and spinal cord injuredpatients, followed by the conclusive remarks.

Robotics and Virtual Reality 101

Advances in Virtual Reality

Virtual reality has been instrumental in bringing research into the real world.VR has been conceived as a tool to liberate consciousness, a digital mandalafor the cyberian age. VR refers to computer-generated, interactive, 3D en-vironments into which people are immersed. It provides a way for people tovisualize, manipulate, and interact with simulated environments through theuse of computers and extremely complex data. Although still a developingtechnology, VR has already been successfully integrated into several aspectsof medicine and psychology. For example, VR is being used in the training ofsurgical procedures [57], the education of patients and medical students [38]and the treatment of psychological dysfunction including phobias [56], post-traumatic stress disorder [55], and eating and body image disorders [54]. Painmanagement methods integrating VR to distract patients’ attention from un-comfortable procedures, such as dental work, chemotherapy, and burn woundcare have also produced encouraging results [11, 26, 45]. Additionally, a num-ber of researchers have integrated VR into the assessment and rehabilitationof cognitive processes, such as visual perception and executive functions [53]and for training fundamental activities of daily life, such as using public trans-port [10] and meal preparation tasks [14]. The most important feature of VRis its non-destructive testing nature, which enables us to conduct the compli-cated and dangerous experiments which in reality could lead to fatal humaninjuries. Similarly, in order to understand the functionality of the brain, weneed a device that can help perform human motor tasks and build an environ-ment which can help neuroscientists to find solution to research problems. VRmakes it possible to rapidly present various rehabilitation tasks with no setupand breakdown time, and provides many more important possibilities that arenot available with real-world applications, i.e. “Distortions of Reality”. Theproperty of objects can be changed in an instant, and this element of surpriseis critical for studying how the sensorimotor system reacts and adapts to newsituations.

Disability at a Glance

According to the World Health Organisation’s (WHO) 2006 World Report onDisability and Rehabilitation [6]:

An estimated 10% of the world’s population – approximately 600 mil-lion people, of which 200 million are children – experience some formof physical, mental, or intellectual disability. The disabled populationis growing as a result of factors such as population growth, ageingand medical advances that are prolonging life. These trends are cre-ating considerable demands for health and rehabilitation services andrequire environmental and attitudinal changes.

In Australia the most recent (1998) National Survey of Disability, Ageingand Carers (SDAC) [1] in 1998 by the Australian Bureau of Statistics (ABS)


estimated that 3.6 million Australians 19% had some form of disability [22].Of these, 2.8 million 78% had a core activity restriction in self-care, mobilityor communication caused by their disability [1].

The Australian Institute of Health and Welfare [1, 3] has recently pre-sented projections of the numbers of people with either profound or severecore activity restrictions for the years 2000–2031, based on the prevalence ofdisability in the SDAC [2, 3]. Projections indicate [2] a 70% increase in thenumber of older people with profound disability over the next 30 years. Themain conditions associated with profound or severe core activity restriction inolder Australians are musculoskeletal, nervous system, circulatory, respiratoryconditions and stroke. There is a clear need for a capable testbed for scientificstudy on upper-extremity motion. Robotic devices, designed to interface withhumans, have already led to great advances in both fundamental and clinicalresearch on the sensory motor system.

Robotics Technology for Rehabilitation

Advancement in technology has always brought a light of hope for the disabledpopulation. In particular, new state of the art robotic devices have alwaysplayed a vital role in the development of more effective rehabilitation devices.Robotics is considered an “angel of survival” in the disabled community [46].The degrees of impairment differs for each disability, including stroke, cere-bral palsy, tetraplegia, paraplegia, amputation, arthritis, osteoarthritis andheart disease. Devices are modified according to the disabled person’s ability,allowing him or her to operate the device in accordance with their degree ofmovement. Sip and puff or chin controlled electrically powered wheelchairsare a good example of modified robotic rehabilitation devices. Robotic reha-bilitation device operation usually requires training to learn how to controlit. As long as the patient operates a rehabilitation device under the super-vision of an occupational therapist or clinician, they have low to no risk ofharming themself if they mishandle the device. In the worst case, the lackof appropriate training to control a rehabilitation device can lead to a fatalinjury and the technology as a life saviour can become a killer. In the caseof spinal cord [52] injured patients, who generally have poor control of theirupper body, they are at greater risk of encountering difficulties and accidentswhen operating a powered wheelchair.

The use of robots for providing physiotherapy is a relatively new disciplinewithin the area of medical robotics. It emerged from the idea of using robots toassist people with disabilities. The adaption of robotic devices to assist in neu-rorehabilitation was first identified by Hogan at MIT [27]. There is currentlya high rate of expansion in the field of neurorehabilitation. This rapid growthcan be attributed to several factors, the first being the emergence of hardwarefor haptics and advanced robotics that could be made to operate safely withinthe a human workspace. The dramatic drop in the cost of computing alongwith the emergence of software to support real-time control further reduces


the cost of producing research prototypes and commercial products. This tech-nological shift has been coupled with better knowledge of the rehabilitationprocess and the social need to provide high-quality treatment for an ageingpopulation.

Neuroplasticity: Learning in Humans and Animals

Human behaviour is not constant. It can change over time as a result of ex-perience [33,40,42]. Similarly synaptic transmission is not constant. Synaptictransmission can change over time as a result of activity and other events inthe central nervous system (CNS). The speculation is in the idea that thereare two different memory storages (neocortex and hippocampus) in the CNSthat have complimentary properties (Fig. 1). The neocortex is thought to belimitless. There is virtually no detectable memory storage limit and it appearsto be permanent, like the way childhood memories can last for a lifetime. How-ever it is understood that the neocortex is a slow learning system, meaning ittakes many repetitions for the neocortical system to learn something. By con-trast, the hippocampus is thought to be a very rapid learning system wherethings are learnt after a single learning trial. However, it is believed that thehippocampus has a smaller capacity than the neocortex. They also have atemporary role, memory stored in the hippocampus [9, 20, 31, 37] exists fornot longer than a week or two. During that time memories are consolidated inthe hippocampus and the neocortex performs the rehearsal process. Once theneocortex completes the rehearsal the hippocampus then forgets that memory.This suggests that the hippocampus plays a time dependent role in handlingmemory. Similar results come from animal studies. If an animal is taught abehavioural task that involves the hippocampus a lesion will form in the hip-pocampus in direct response to that new memory. After one or two weeksthe lesion will disappear and the memory will be lost. However, the memory

Fig. 1. Cross-sectional view of human brain and hippocampus area. Anatomy ofhuman brain, highlighting hippocampal and cortex regions. Picture is courtesy of“Cerepus AS R©”, Norway


will be transferred to the neocortex at that point for long term recall. Thisevidence arises from the idea that memory has stages:

• Initial encoding stage; followed by• Period of consolidation (i.e. period of lesion)

This conclusion shows that the transfer of activity is both intra and inter-modal and that where there is a need for the brain to reorganise to adapt tonew circumstances this reorganisation is not necessarily confined to the un-derstood maps of the homunculus brain [47]. The fact that this reorganisationoccurs even in mature adult humans is a primary justification for neuroreha-bilitation following a disability [33].

2 Need for Engineering Smart Body–Machine Interfaces

The human body is capable of learning even after stroke or injury [41, 43,44]. Presently wheelchair users have to practice how to control their devicebefore perfecting their technique [18, 19, 51]. In this case the responsibility oflearning resides solely with the patient. This situation for a patient is difficultand contrary to the marvels of advances in robotics and neuroscience [15,43]. The purpose of this book chapter is to modify the current situation byintroducing the novel idea of smart body–machine interfaces [17, 34, 44, 50],capable of learning and understanding patients’ residual degrees of freedomand control. The idea behind the research is to take machines towards thepatients, rather than the patients towards the machines. In order to engineera smart body–machine interface, we exploit two closely coupled concepts:the residual degrees of freedom of spinal cord injured patients to control theassistive device, and the ability of the brain to reorganise the movements afterdisability.

2.1 3D Immersive Virtual Reality System

3D Virtual Reality System

VR is inherently multidimensional [28]. As well as freedom of translation androtation, in VR we can travel in scale and time [30]. Thus, the mental modelof the environment we perceive changes as we travel in VR. We have devel-oped a robotic VR training system, RIMS (Robotics Interactive MultisensorySimulation) for training stroke and spinal cord injured patients, using an im-mersive semi-cylindrical projection system (VISOR: Virtual and InteractiveSimulation of Reality) in our Virtual Reality Systems (VRS) Laboratory. Thesystem consists of three projectors which display the virtual world onto a 6-mwide semi-cylindrical screen canvas. The user is positioned slightly off centretowards the canvas to allow a 160◦ field of view (FOV).


Precision Position Tracker (PPT)

The WorldVizTM PPT system has been used as an alternative to the sensorshirt although both sensing systems are connected to the VR system. PPTconsisted of four CCD cameras Fig. 2. Two cameras were mounted on theprojectors as shown in Fig. 2 and two were mounted on the top of the projec-tion canvas screen. These CCD cameras are capable of tracking the infraredemitting diodes (IRED) as shown in the Fig. 3. The displacement of IREDsin the Euclidean space were mapped to the two control signals of the roboticwheelchair. Consequently the IREDs were attached to each shoulder of theparticipant and, by displacing shoulders, the participant was able to controlthe 3D robotic wheelchair in VR.

(a) View of virtual reality lab in opera-tion

(b) The 160◦ spanned 3D fully immersivevirtual reality projection system

(c) Precision position tracking system CCD camerasmounted over the projectors

Fig. 2. The 160◦ spanned virtual reality projection laboratory


(a) IRED battery operated Sensors (b) A Graphical user Interface to trackthe IRED sensors

Fig. 3. A precision position tracking system for virtual reality environments

Fig. 4. Virtual robotics wheelchair

3D Virtual Robotics Wheelchair

For interoperability, extendability, maintenance and reusability purposes amodular design approach was taken, where each component had separate rolesand responsibilities and well-defined interfaces to allow other components toaccess their functionality. The modular design approach provides a sustainabledesign where we could (re)use existing third party components and swap com-ponents as required. A robotics wheelchair model was shown in Fig. 4 createdusing 3D Studio Max and integrated with a hospital type environment builtwith coin3D libraries to generate corridors and pathways. The whole systemwas projected on the screen using WorldVizTM Vizard interactive software.

3D Virtual Robotic Wheelchair’s Kinematics

A non-holonomic mobile robotic kinematics study was used [59] to simulatethe robotic wheelchair’s motion model. The wheelchair is considered as aunicycle type robot [12] having an egocentric axis of rotation, i.e. its centre of


(a) (b)

Fig. 5. (a) Virtual wheelchair kinematics model based upon unicycle robot.(b) Virtual wheelchair’s 3D model created using coin3D [4] libraries

Fig. 6. Virtual wheelchair’s position update

gravity is calculated upon the point on the rear differential wheels as shownin Figs. 5 and 6. The motion kinematic model is as follows: The wheelchair ismodelled as a simple two-wheel robot [12,59] as shown in Fig. 6. The kinematicequations of the wheelchair are

x(t) = v(t) cos(θ(t))y(t) = v(t) sin(θ(t)) (2)θ(t) = ω(t) .

The kinematic model of the wheelchair has two control inputs: the forwardspeed, v and the angular velocity (ω). Therefore, in discrete time, the law ofmotion of the wheelchair are

xk+1 = xk + vk cos(θk)∆t

yk+1 = yk + vk sin(φk)∆t (3)θk+1 = θk + ωk∆t .

The two control inputs are generated by processing algorithms applied tothe shirt signals. The rotational and translational components of the speedare obtained by scaling two values, Vr and Vf derived from the subject’s


body motions. Accordingly, the virtual wheelchair moves from the actual point(xk, yk) to xk+1, yk+1 as represented in Fig. 5. In which

∆S = vk∆t = k1Vf k∆t (4)∆θ = ωk∆t = k2Vrk∆t , (5)

where k1, k2 are proportionality constants and ∆t is the time interval betweenthe two consecutive frames of the virtual reality.

2.2 Next Generation Sensor Shirt: Smart Garment

To capture the residual mobility of the disabled patient, we have used a nextgeneration smart garment “Sensor Shirt”. The sensor shirt (Fig. 7) has been re-alized by directly printing a conductive elastomer (CE) material (commercial

(a) Front view of 52-sensors laden wearableshirt

(b) Back view of sensor laden shirt

Fig. 7. Sensor shirt


product provided by Wacker LTD [5]) on a lycra/cotton fabric previouslycovered by an adhesive mask. The mask adopted to realise the sensor shirtis shown in Fig. 7 and it is designed according to the desired sensor and con-nection topology. CE composites show piezoresistive [60, 64] properties whena deformation is applied [49]. CE materials can be applied to the fabric orto other flexible substrates. They can be employed as strain sensors [35, 36]and they represent an excellent trade-off between transduction properties andthe possibility of integration in textiles. Quasi-statistical and dynamic sensorcharacterisation has been done in [35]. Dynamic CE sensors present peculiarcharacteristics such as non-linearity in resistance to length transduction andlarge relaxation times [48,64] which should be take into account in the controlformulation.

Sensor Shirt Layout

The sensor shirt is divided into six sections as shown in Fig. 7 and Table 1.In each shirt section, sensors are connected in series and they are representedby the wider lines of Fig. 8. Connections between the sensors and the elec-tronic acquisition unit are represented by the thinner line of Fig. 8. Sinceconnections are realised by the same material adopted for the sensors, theyhave an unknown and unpredictable change in electrical resistance when theuser moves. For this reason the acquisition unit front-end has been designedto compensate for the connection resistance variations. The sensor series issupplied with a constant current I and the voltage fall across consecutive con-nections are acquired using high input impedance amplifiers (instrumentationamplifiers) following the methodology of [62]. Let us consider the example ofsensor Sll 3 (the prototype electrical model and the acquisition strategy areshown in Fig. 8). This is sensor placed in the left wrist region of the shirt andit is represented by the light-blue line in Fig. 8. The connections to this sensorare represented in Fig. 8 by the two green lines. If the amplifier is connectedbetween Cll 3 and Cll 4, only a little amount of current flows through inter-connections compared to the current that flows through S 3. In this way, ifthe current I is well dimensioned, the voltage read by the amplifier is almost

Table 1. Sensor shirt layout

Body part Left side Right side

Front shoulder 6 sensors 6 sensorsfs Slfs 1–Slfs 6 Srfs 1–Srfs 6

Back shoulder 8 sensors 8 sensorsbs Slbs 1–Slbs 8 Srbs 1–Srbs 8

Limb 12 sensors 12 sensorsl Sll 1–Sll 12 Srl 1–Srl 12

Total sensors 26 26


Srbs_1

Srbs_8S

rl_12

Srl_1

Srfs_1

Srfs_6 S

Ifs_6

SIfs_1

SII_1

SII_12

SIbs_8

SIbs_1

SII_3

CII_3

CII_4

(a) (b)

Fig. 8. (a) The mask used for the sensor shirt realization. The sensor Sll3 placedon the left wrist (light blue line) and its connections (green lines) are pointed out.(b) Prototype (limb) electric model and acquisition strategy

equal to the voltage drop on the sensor that is proportional to the sampleresistance. In conclusion, a generic sensor consists of a segment of the boldtrack between two consecutive thin track intersections.

Signal Acquisition

Two electronic acquisition customised units, (one for the left side and theother for the right side) were designed to acquire signals from the sensorshirt. Each unit consists of three signal generators (needed to supply the sen-sors series with the constant current), 32 instrumentation amplifiers (neededto read voltages across sensors) and a final stage for signal low pass filter-ing. The analog signals acquired from the two units are digitised using ageneral purpose 64 channel acquisition card and processed in real-time us-ing a personal computer. Real-time signal processing has been performed byusing the xPC-Target R© toolbox of Matlab R©. The output of the signal process-ing stage, i.e. the wheelchair controls, are sent to the virtual wheelchair de-scribed in the section below by using universal datagram protocol (UDP)connection.

3 A Virtual Reality Rehabilitation Testbed for SpinalCord Injured Patients

A testbed for rehabilitation purposes, especially for spinal cord injured (SCI)patients, has been designed with the help of a VR interactive system. Thenovelty of the testbed lies in the fact that we mapped [39] the redundantbody signals, i.e. the left over mobility of the SCI patients, to control thetranslational (v) and angular velocity (ω) of the wheelchair. Likewise it is alsopossible to derive translational acceleration (v) and angular acceleration (ω)


from residual mobility to derive virtual wheelchair inside the virtual realityscene. Two different approaches has been used to test the efficacy of the systemand in both cases encouraging results were obtained.

Virtual Navigation with Precision Position Tracking (PPT) System

In PPT system navigation, we attached the sensors with the shoulders of thepatient as shown in Fig. 9. The participant was asked to move his shouldersforward and backward to calibrate the forward–backward and left–right move-ment of the virtual wheelchair. Once the sensors were calibrated, participantswere immersed in the virtual scene consisting of pathways, corridors, rooms. Adark line was painted on the floor for the participant to follow in the virtualscene. The participant was able to navigate in the environment reasonablywell using the arm and shoulder movements with minimal practice.

(a) PPT IRED sensors attached tothe participant’s shoulder

(b) Participant is controlling the ro-botics wheelchair through shouldermovements

(c) The control strategy of realwheelchair is adopted to control thevirtual wheelchair in 3D VR

(d) Trajectory obtained from theparticipant’s data while training invirtual reality

Fig. 9. Navigating a virtual wheelchair simulator in 3D virtual reality with precisionposition tracking system


Virtual Navigation with Novel Sensor Shirt

The sensor shirt as shown in Fig. 10 was worn by participant. The left overbody mobility signals were captured by the acquisition system. The partic-ipant was asked to make comfortable body movements using their availablemobility range. These comfortable body postures were then mapped into thecontrol signals of the wheelchair, i.e. forward–backward and left–right posi-

t = tmLeft elbow movement

t = tIRight shoulder movement

t = tiLeft shoulder movement

t = th

Right elbow movement

t = 0

Rest Position

(a) Remapping of mobility and control strategy

(b) Sensor shirt control inVirtual Reality

VV

0 50 100 150 200 2500.4

0.6

0.8

1

1.2

1.4

0.4

0.6

0.8

1

1.2

1.4

0 50 100 150 200 250

Rigth Limb

Sec.

Left Limb

Sec.

(c) Redundant body signals of the two limbs

Fig. 10. Navigation in virtual reality while wearing sensor shirt


(a) Virtual reality environment (b) Top view of the Trajectory in Virtualreality

0

5

Trajectory, Mean Error = 4.7981

Subject TrajectoryPath

−25

−20

−15

−10

−5

10

15

20

25

−25 −20 −15 −10 −5 0 5 10 15 20 25

(c) Trajectory obtained from partici-pant’s data in virtual reality, in first fewtrials

Trajectory, Mean Error =1.847

0

5

−25

−20

−15

−10

−5

10

15

20

25

−25 −20 −15 −10 −5 0 5 10 15 20 25

Subject TrajectoryPath

(d) Trajectory data of participant afterlearning phase

Fig. 11. Navigating a virtual wheelchair simulator in 3D virtual reality with preci-sion position tracking system

tions. Again the participant fully immersed in the 3D VR scene similar to theone described in the first experiment. After minimal practice, the participantwas able to comfortably control the wheelchair, as shown in the Figs. 10 and11 the path made by participant after little practice over the marked line ofthe virtual reality scene.

4 Conclusion

The amalgamation of robotics technology, intelligent interfaces and 3D im-mersive virtual reality may lead to the development of a whole new approachto the design of assistive devices. This approach is based on the key conceptthat the burden of learning to control such devices should not fall entirely onthe patient. The field of multimedia and machine learning has been rapidly


developing in the recent decade and is now sufficiently mature to design inter-faces that are capable of learning the user as the user is learning to operate thedevice. In this case, “learning the user” means learning the degrees of freedomthat the patient is capable to move most efficiently and mapping these degreesof freedom to wheelchair movements. We should stress that such mappingcannot be static because in some cases the patients will eventually improvewith practice. In other, more unfortunate cases, a disability may progressivelydegenerate and the patient’s mobility may deteriorate as a result. We haveapplied and tested the rehabilitation process in virtual reality via onboard andoff-board sensing. The mapping of body movements in virtual reality in anopportunistic way has been shown in [13, 24, 25] and we intend to apply thesame techniques in our future experiments. Our approach takes technologytowards the patient rather than the patient towards the technology.

Acknowledgment

This research has been approved by the ethical committee of Macquarie Uni-versity Sydney Australia, under the humans research ethical act of New SouthWales, Australia, in approval letter No. HE23FEB2007-D05008 titled Per-sonal Augmented Reality and Immersive System based Body machine Interface(PARIS based BMI).

References

1. Australian Bureau of Statistics (ABS). 1998 disability, ageing and carers, aus-tralia: Confidentialised unit record file. Technical paper. Canberra: ABS.1999.

2. Australian Bureau of Statistics (ABS). Population projections australia: 1999to 2101. Canberra: ABS, 2000, (Catalogue No. 3222.0.).

3. Australian institute of health and welfare (AIHW). disability and ageing: Aus-tralian population patterns and implications. Canberra: AIHW. 2000, (AIHWCatalogue No. DIS 19.).

4. Coin 3d graphics library. www.coin3d.org.5. Elastosil lr3162. www.wacker.com.6. World Health Organisation. World report on disability and rehabilitation. Con-

cept Paper, World Health Organisation. 2006.7. C. Baker, J.B. Tenenbaum, and R.R. Saxe. Bayesian models of human action

understanding. Advances in Neural Information Processing Systems, 18, 2006.8. G.A. Barnard and Thomas Bayes. Studies in the history of probability and

statistics: Ix. thomas bayes’s essay towards solving a problem in the doctrine ofchances. Biometrika, 45:293–315, 1958.

9. T.V.P. Bliss and G.L. Collingridge. A synaptic model of memory: Long-termpotentiation in the hippocampus. Nature, 361:31–39.

10. D.J. Brown, S.J. Kerr, and V. Bayon. The development of the virtual city: A usercentered approach. In 2nd European Conference on Disability, Virtual Realityand Associated Techniques, Mount Billingen, Skovde, Sweden, September 1998.


11. A. Buckert-Donelson. Heads-up products: Virtual worlds ease dental patients.VR World, 3:9–16, 1995.

12. K. ByungMoon and T. Panagiotis. Controllers for unicycle-type wheeled robots:Some theoretical results and experimental validation. IEEE Transactions onRobotics and Automation, 18(3):294–307, 2002.

13. S. Challa, T. Gulrez, Z. Chazcko, and T. Paranesha. Opportunistic informationfusion: A new paradigm for next generation networked sensing systems. In8th IEEE International Conference on Information Fusion, Philadelphia, USA,2005.

14. C. Christiansen, B. Abreu, K. Ottenbacher, K. Huffman, B. Masel, andR. Culpepper. Task performance in virtual environments used for cognitiverehabilitation after traumatic brain injury. Archives of Physical Medicine andRehabilitation, 79:888–892, 1998.

15. M.E. Clynes and N.S. Kline. Cyborgs and space. Astronautics, American RocketSociety, 14:26–27, 1960.

16. M.A. Conditt and F.A. Mussa-Ivaldi. Central representation of time dur-ing motor learning. Philosophical Transcations of Royal Society of London,96:11625–11630, 1999.

17. J.P. Donoghue, Connecting cortex to machines: Recent advances in brain inter-faces. Nature Neuroscience Reviews, 5:1085–1088, 2002.

18. L. Fehr, W.E. Langbein, and S.B. Skaar. Adequacy of power wheelchair con-trol interfaces for persons with severe disabilities: A clinical survey. Journal ofRehabilitation Research and Development, 37:353–360, 2000.

19. C.C. Flynn and C.M. Clark. Rehabilitation technology: Assessment practicesin vocational agencies. Assistive Technology, 7:111–118, 1995.

20. T.F. Freund and G. Buzsaki. Interneurons of the hippocampus. Hippocampus,6:347–470, 1958.

21. F. Gandolfo, F.A. Mussa-Ivaldi, and E. Bizzi. Motor learning by field approxi-mation. Proceedings of National Academy of Sciences USA, 93:3843–3846, 1996.

22. C.L. Giles, D. Cameron, and M. Crotty. Disability in older australians: Projec-tions for 2006–2031. Medical Journal of Australia, 179:130–133, 2003.

23. T.L. Griffiths and J.B. Tenenbaum. Statistics and the bayesian mind. Signifi-cance, 3(3):130–133, 2006.

24. T. Gulrez and S. Challa. Sensor relevance validation for autonomous mobile ro-bot navigation. In IEEE Conference on Robotics Automation and Mechatronics(RAM), Bangkok, Thailand, June 7–9, 2006.

25. T. Gulrez, S. Challa, T. Yaqub, and J. Katupitiya. Relevant opportunisticinformation extraction scheduling in heterogeneous sensor networks. In 1st IEEEInternational Workshop on Computational Advances in Multi-Sensor AdaptiveProcessing, Mexico-City, 2005.

26. H.G. Hoffman, J.N. Doctor, D.R. Patterson, G.J. Carrougher, and T.A.I.Furness. Use of virtual reality for adjunctive treatment of adolescent burn painduring wound care: A case report. Pain, 85:305–309, 2000.

27. N. Hogan, H. Krebs, J. Charnnarong, P. Srikrishna, and A. Sharon. Mit -vmanus a workstation for manual therapy and training ii. In SPIE Conf. Tele-manipulator Technologies, pages 28–34, 1992.

28. A. Johnson, D. Sandin, G. Dawe, Z. Qiu, and D. Plepys. Developing the paris:Using the cave to prototype a new vr display. In Proceedings of IPT 2000, Ames,Iowa, USA, June 2000.


29. K. Kording and D. Wolpert. Bayesian integration in sensorimotor learning.Nature, 427:244–247, 2004.

30. M. Kavakli and M. Lloyd. Spaceengine: A seamless simulation system for vir-tual presence in space. In Innovations in Intelligent Systems and Applications,IEEE Computational Intelligence Society, pages 231–233, Turkey, Yildiz Tech-nical University, Istanbul, Turkey, 2005.

31. J. Keefe and L. Nadel. The Hippocampus as a Cognitive Map. Oxford UniversityPress, New York, 1978.

32. K. Kording and D. Wolpert. Bayesian decision theory in sensorimotor control.Review – Trends in Cognitive science, 10:319–326, 2006.

33. J.W. Krakauer and R. Shadmehr. Consolidation of motor memory. Review –Trends in Neuroscience, 29:58–64, 2006.

34. A. Kubler. Brain computer communication: Unlocking the locked. PsychologyBulletin, 127:358–375, 2001.

35. F. Lorussi, W. Rocchia, E.P. Scilingo, A. Tognetti, and D. De Rossi. Wear-able redundant fabric-based sensors arrays for reconstruction of body segmentposture. IEEE Sensors Journal, 4(6):807–818, 2004.

36. F. Lorussi, E.P. Scilingo, M. Tesconi, A. Tognetti, and D. De Rossi. Strainsensing fabric for hand posture and gesture monitoring. IEEE Transactions onInformation Technology in Biomedicine, 9(3):372–381, 2005.

37. J.L. McClelland, B.L. McNaughton, and R.C. O’Reilly. Why there are comple-mentary learning systems in the hippocampus and neocortex: Insights from thesuccesses and failures of connectionist models of learning and memory. Psych-logical review, 102:419–457, 1995.

38. Medical, Readiness, and Trainer-Team. Immersive virtual reality platform formedical training: A “killer-application”. In Medicine Meets Virtual Reality 2000,pages 207–213, Burke, Virginia, USA, 2000.

39. C. Mercier, K. Reilly, C. Vargas, A. Aballea, and A. Srigu. Mapping phan-tom movement representations in the motor cortex of amputees. Brain, 129:2202–2210, 2006.

40. F.A. Mussa-Ivaldi and E. Bizzi. Motor learning through the combinationof primitives. Philosophical Transcations of Royal Society of London, 355:1755–1769, 2000.

41. F.A. Mussa-Ivaldi, A. Fishbach, T. Gulrez, A. Tognetti, and D. De, Rossi.Remapping the residual motor space of spinal-cord injured patients for the con-trol of assistive devices. In Neuroscience 2006, Atlanta, GA, USA, October14–18, 2006.

42. F.A. Mussa-Ivaldi, N. Hogan, and E. Bizzi. Neural, mechanical, and geo-metric factors subserving arm posture in humans. Journal of Neuroscience,5:2732–2743, 1985.

43. F.A. Mussa-Ivaldi and L.E. Miller. Brain machine interfaces: Computationaldemands and clinical needs meet basic neuroscience. Review, Trends in Neuro-science, 26:329–334, 2003.

44. F.A. Mussa-Ivaldi and S. Solla. Neural primitives for motion control. IEEEJournal of Oceanic Engineering, 29:640–650, 2004.

45. M. Oshuga, F. Tatsuno, K. Shimono, K. Hirasawa, H. Oyama, and H. Okamura.Development of a bedside wellness system. Cyberpsychology and Behavior,1:105–111, 1998.


46. J. Patton and F. Mussa-Ivaldi. Robotic teaching by exploiting the nervous sys-tem’s adaptive mechanisms. In 7th International Conference on RehabilitationRobotics (ICORR), Evry, France, 2001.

47. W. Penfield and T. Rasmussen. The Cerebral Cortex of Man: A Clinical Studyof Localisation of Function. Macmillan, New York, 1950.

48. Wang Peng, Xu Feng, Ding Tianhuai, and Qin Yuanzhen. Time dependence ofelectrical resistivity under uniaxial pressures for carbon black/polymer compos-ites. Journal of Materials Science, 39(15), 2004.

49. Wang Peng, Ding Tianhuai, Xu Feng, and Qin Yuanzhen. Piezoresistivity ofconductive composites filled by carbon black particles. Acta Materlae Composi-tae Sinica, 21(6):34–38, 2004.

50. B.E. Pfingst. Neural Prostheses for Restoration of Sensory and Motor Function,J.K. Chapin, K.A. Moxon (eds.). CRC, Boca Raton, 2000.

51. R.G. Platts and M.H. Fraser. Assistive technology in the rehabilitation of pa-tients with high spinal cord injury lesions. Paraplegia, 31:280–287, 1993.

52. M.W. Post, F.W. vanAsbeck, A.J. vanDijk, and A.J. Schrijvers. Spinal cordinjury rehabilitation: 3 functional outcomes. Archives of Physical Medicine andRehabilitation, 87:59–64, 1997.

53. L. Pugnetti, L. Mendozzi, E. Barbieri, F. Rose, and E. Attree. Nervous systemcorrelates of virtual reality experience. In First European Conference on Dis-ability, Virtual Reality and Associated Technology, pages 239–246, Maidenhead,UK: The University of Reading, July 1996.

54. G. Riva and L. Melis. Virtual reality for the treatment of body image dis-trubance. In Virtual Reality in Neuro-Psycho-Physiology, Amsterdam, 1997.

55. B.O. Rothbaum, L.F. Hodges, R. Alarcon, D. Ready, F. Shahar, K. Graap,J. Pair, P. Hebert, D. Gotz, B. Wills, and D. Baltzell. Virtual reality exposuretherapy for ptsd vietnam veterans: A case study. Journal of Traumatic Stress,12:263–272, 1999.

56. B.O. Rothbaum, L.F. Hodges, and R. Kooper. Virtual reality exposure therapy.Journal of Psychotherapy Practice and Research, 6:291–296, 1997.

57. R.M. Satava. Medical virtual reality: The current status of the future. In TheMedicine Meets Virtual Reality 4th Conference, pages 100–106, Berlin, Germany,Sept 1996.

58. R. Shadmehr and F.A. Mussa-Ivaldi. Adaptive representation of dynamics dur-ing learning of a motor task. Journal of Neuroscience, 14(5):3208–3224, 1994.

59. S. Takezawa, T. Gulrez, C.D. Herath, and W.M. Dissanayake. Environmentalrecognition for autonomous robot using slam. real time path planning withdynamical localised voronoi division. International Journal of Japan Society ofMechanical Engineering (JSME), 3:904–911, 2005.

60. M. Taya, W.J. Kim, and K. Ono. Piezoresistivity of a short fiber/elastomermatrix composite. Mechanics of Materials, 28(3):53–59, 1998.

61. J.B. Tenenbaum and T.L. Griffiths. Generalization, similarity, and bayesian in-ference. Behavioral and Brain Sciences, 24:629–641, 2001.

62. A. Tognetti, F. Lorussi, R. Bartalesi, S. Quaglini, M. Tesconi, G. Zupone, andD. De Rossi. Wearable kinesthetic system for capturing and classifying upperlimb gesture in post-stroke rehabilitation. Journal of NeuroEngineering andRehabilitation, 2(8), 2005.


63. D.M. Wolpert, Z. Ghahramani, and M.I. Jordan. An internal model for senso-rimotor integration. Science, 269:1880–1882, 1995.

64. XiangWu Zhang, Yi Pan, Qiang Zheng, and XiaoSu Yi. Time dependence ofpiezoresistance for the conductor-filled polymer composites. Journal of PolymerScience, 38(21), 2000.

Modelling Interactive Non-Linear Stories

Fabio Zambetta

School of CS&IT, RMIT UniversityGPO Box 2476VMelbourne VIC 3001, [email protected]

Summary. A CPRG (Computer Role Playing Game) is a video game whereparticipants assume the roles of fictional characters and collaboratively create orimmerse in stories. Designers of such games usually create complex epic plots fortheir players to experience, but formal modelling techniques to shape and navigateinteractive stories are not currently adopted. In this chapter we build the case for astory-driven approach to the design of CRPGs exploiting a mathematical model ofpolitical balance and conflict, and scripting based on fuzzy logic. Our model differsfrom a standard HCP (Hybrid Control Process) for the use of fuzzy logic (or fuzzystate machines) to handle events, while an ODE (Ordinary Differential Equation)is needed to generate continuous level of conflict over time. Ultimately, using thisapproach not only can game designers express gameplay properties formally usingquasi-natural language, but they can also propose a diverse role-playing experienceto their players. The interactive game stories designed with this methodology canchange under the pressure of a variable political balance, and propose a differentand innovative gameplay style.

1 Introduction

Computer Role Playing Games allow participants to assume the roles of fic-tional characters and collaboratively create or unravel complex storylines.Game developers usually face a complex task, which consists in providingplayers with a compelling and coherent storyline that exists in an interactivevirtual environment. Unfortunately, storylines have been traditionally bothlinear and deterministic in media such as books, movies, etc. whereas inter-active environments are intrinsically non-linear and non-deterministic. Thisfundamental oxymoron has fuelled considerable interest, leading researchersand practitioners in the videogames area to coin the term “interactive sto-rytelling” [6]. Formal modelling techniques to shape and design interactivestories are not currently widespread, and that also partially contributes tothe exorbitant budgets required to build a CPRG.

F. Zambetta: Modelling Interactive Non-Linear Stories, Studies in Computational Intelligence

(SCI) 96, 119–138 (2008)


120 F. Zambetta

The objective of our research lies on one hand in exploiting dynamical mod-els to lead to a more formal design process for (story-driven) games [13], andon the other on improving the current approaches to interactive storytelling.It is generally assumed that stories and drama are generated by conflict, asdetailed by Aristotle a very long time ago [11]. Therefore, we envisage an ex-tension to story-driven games, where not only can players influence the gamestory, but also the story itself can change under the pressure of political bal-ance. Our work is rooted in Richardson’s dynamical model of Arms Race [17],devised to analyze the causes of international conflicts, initially applied byRichardson to the World War I scenario. Our modification of Richardson’sfaction model brings numerous improvements over the standard faction mod-els currently used in RPG games.

First and foremost, such an approach makes more options available toRPG designers that will enable the creation of different types of stories thatintegrate political considerations in the plot itself and extend the usual storydriven approaches. By simply varying the basic parameters of the core model,many scenarios can be created that correspond to different political status quo(e.g., tense relationships, truce, initial friendly relations, etc.), which can inturn evolve over time in different types of equilibria. Secondly, players’ choiceswill impact the in-game political balance, but at the same time the plot willevolve under the pressure of political events, giving rise to a novel gameplaystyle. The scenario we are working on has been dubbed Two Families. Playerstake the side of one of two influential families in the fight for supremacy in afictional city, and decide whether they want to further their faction’s politicalagenda or act as a maverick, thus contributing to alter the political balance.

The remainder of the chapter is organized as follows: Sect. 2 describes somerelated work; Sect. 3 introduces dynamical models, and describes the modi-fied Richardson’s model used to compute a political balance among factions.Section 4 details the most relevant scenarios of use for the model; Sect. 5 intro-duces our prototype and the results obtained so far, while Sect. 6 eventuallyoutlines our future work.

2 Related Work

The games industry has been quite resistant to the use of advanced intelligenttechniques, relegating the game AI to use efficient but non-scalable computingdevices such FSM (Finite State Machines) or RBS (Rule Based Systems). Thereasons for this are essentially twofold: On one hand, before the introductionof GPUs (Graphics Processing Units), most of the CPU time in an applica-tion was devoted to graphics and rendering. On the other hand, a commonmisconception in the games industry has been that learning and adaptationof characters to their worlds could lead to chaotic and unpredictable behav-iours [9]. Fortunately, with the introduction of GPUs and multi-core CPUs thegame AI can play a bigger role in the game development process. Moreover, a

Modelling Interactive Non-Linear Stories 121

few game development and research teams have started to demonstrate thatthe latter argument is unfounded.

The work of the Synthetic Characters Lab at MIT [5] represents a funda-mental step forward to produce a practical and stable approach to real-timelearning for virtual characters modelling animals, especially dogs and sheep.In their formulation characters’ actions selection is driven by a set of rules,called ActionTuples that are comprised of different parts, and that tend togeneralize well-known approaches to the solution of the reinforcement learn-ing problem such as TD(λ) or Q-learning [20]. The TriggerContext indicatesexternal conditions that must be met in order for the ActionTuple to be acti-vated; the Action represents what the creature should do if the ActionTupleis active; the ObjectContext describes on what objects the Action can beapplied; the doUntilContext describes the conditions that cause the Action-Tuple to deactivate; the Results slot contains Predictors, trying to estimate(within a confidence level) what event will occur next; the Intrinsic Value is amulti-dimensional value describing the ActionTuple’s perceived effect on thecreature’s internal drives. Although the ActionTuple can create some interest-ing behaviour for a synthetic animal and provide a basic form of adaptiveness,it cannot learn how to satisfy high-level goals. For example, a synthetic dogcannot learn the shepherd’s intention to move the sheep south.

Lionhead Studio’s Black & White [9], a so-called “god-game”, featuresintelligent learning creatures as an integral part of the storyline to be un-ravelled in the game. The player takes the role of a divinity who can trainhis creature, his emissary on the world, to perform tasks on his behalf andultimately expand his community of worshippers. The creatures use a classicBDI (Belief, Desire, Intention) architecture [4], augmented in an ingeniousway: The creature can have Opinions about what objects are most suitablefor satisfying different desires. These Opinions are implemented as decisiontrees, whereas desires are implemented via perceptrons, each with a numberof different desire sources. Creatures then deliberate about the most importantgoal and the most important type of object to act on. The crucial innovationintroduced by Black & White is the use of heterogeneous sources of learning:Not only do creatures employ reinforcement learning by means of direct feed-back (the player rewards or punishes them), but they also learn by example.Creatures observe the action performed by the player first, make a guess athis goals, and then construct a belief about the object he was acting on: Thisway the creature can escape local minima it would probably incur otherwise(i.e., it can avoid a rigid behavioural routine, that would lead to stagnationor premature standardization). There are two limitations to this creatures’architecture: Creatures can plan at the goal (high) level, but once a goal hasbeen chosen along with a suitable object, the appropriate action for satisfyingthat goal is found in a precomputed plan library and no dynamic planning canbe exploited to fulfil the goal in an alternative way. On top of that, only 40desires are available to the agent and no mechanism is provided to constructnew ones.

122 F. Zambetta

A consistent number of commercial video games have started to use fuzzylogic instead, most notably best-seller titles such as The Sims or the strate-gic game Civilization: Their main use is devoted to the real-time control ofautonomous agents (The Sims) or for strategic decision making (Civilizationand other similar titles). The appeal of fuzzy logic mainly stems from itshuman-readable form, which makes it a perfect candidate for manipulationby non-technical folk such as game designers. Moreover, fuzzy logic may becombined with an FSM, whose use is ubiquitous in the games industry: FuFSMor Fuzzy Finite State Machines [12] have in fact been used in the already citedThe Sims.

Story-driven games [7] (the focus of our contribution) have not risen tothe challenge of integrating intelligent techniques, the main reason being thelargely adopted gameplay style. These games tend to convey experiences toplayer that depend on very deep (but linear) pieces of digital narrative: Insuch conditions most of the interaction can be engineered via scripting func-tions implemented in a scripting language of choice. However, such solutionsare not scalable and lead to extremely high production costs: A switch toadvanced intelligent techniques will benefit the games industry as a whole.The most relevant example in this area is Facade, an experiment in interactivedrama [14]. The “player” plays the character of a friend of Grace and Trip, anattractive and successful couple in their thirties. At an evening get-together attheir apartment, the player will witness the high-conflict dissolution of Graceand Trip’s marriage. Players may contribute their own actions to change thecourse of the couple’s lives and how the whole drama unfolds. Facade sharesthe same motivations of our work, i.e., finding a middle ground between verystructured and constrained game narrative, and sandbox (or strategic) gameswhere agency is of paramount importance but the emerging narrative can sel-dom qualify as such (e.g., it is not dramatic). Their mean to achieve this goaldiffers quite substantially from ours, as does the scenario chosen to showcasetheir technical infrastructure. Because their scenario is primarily based aroundconversational agents, they implemented ABL (A Behaviour Language) andNLU (Natural Language Understanding), which are languages for languageplanning and understanding. The most interesting component in their archi-tecture is a drama manager which rearranges elementary components of nar-rative renamed beats (dubbed scenes in our approach) to achieve a dramaticeffect. Our approach relies on the use of fuzzy rule sets to manage conflictwhich is intrinsically tied to drama [11]. Ultimately, both the approaches aimat recombining chunks of narrative in a way that preserves the dramatic sig-nificance in the storyline, but the means to achieve this are different. However,our use of fuzzy logic may render the game designer’s job easier due to the useof quasi-natural language clauses; also, combining an HCP with a fuzzy rulebase allows us to include more gameplay open-endedness. The use of an HCPcan be in fact generalized and used to manage a potentially infinite numberof gameplay features.


3 Improving Richardson’s Model of Arms Race

Richardson’s Arms Race model was developed by Lewis Fry Richardson pre-dict whether an arms race between two alliances was to become a prelude toa conflict. The original model consists of a system of two linear differentialequations, but it can be easily generalized to a multi-dimensional case [17].Richardson’s assumptions about the model are given below:

• Arms tend to accumulate because of mutual fear.• A society will generally oppose a constant increase in arms expenditures.• There are factors independent of expenditures which conduce to the pro-

liferation of arms.

The actual equations describing this intended behaviour are given as

x = ky − ax + g, (1)y = lx − by + h.

The values of x and y indicate the accumulation of arms for each nation.Clearly, we can also rewrite the equations in matrix form yielding, with propersubstitutions:

z = Az + r, (2)

where

A =

(−a k

l −b

), z =

(x

y

), and r =

(g

h

). (3)

The solutions of the system of linear ODEs (Ordinary Differential Equa-tions) [2] do not depend much on the values of the constants, but ratheron their relative magnitude, and the signs of g and h, which represent inRichardson’s view the grievance terms. The constants k and l are named fearconstants (mutual fear), a and b are the restraint constants (internal opposi-tion against arms expenditures), and as already mentioned, g and h are thegrievance terms (independent factors, which can be interpreted as grievanceagainst rivals). Note that only g and h are allowed to assume negative values.When analyzing the model, one will need to take into account the optimallines (where the first derivatives of x and y equal 0), the equilibrium pointP*=(x*, y*) where the optimal lines intersect, and the dividing line L* forcases where equilibrium depends on the starting point. Trajectories headingtowards positive infinity are said to be going towards an unlimited armamentor a runaway arms race, whereas the ones going towards negative infinity aresaid to be going towards disarmament. There are two general cases that canoccur in practice, in the general assumption that detA = 0:

• All trajectories approach a stable point (stable equilibrium, see Fig. 1a).

124 F. Zambetta

(a) The system trajectories converge to an equi-librium point.

(b) The system trajectories depend on the initialpoint, and can lead to different outcomes. Thedividing line is also depicted.

Fig. 1. Possible equilibria for the system

• Trajectories depend on the initial point: They can either drift towardspositive/negative infinity or approach a stable point if they start on thedividing line (unstable equilibrium, see Fig. 1b).

If ab > kl, we will achieve a stable equilibrium: An equilibrium point is con-sidered stable (for the sake of simplicity we will consider asymptotic stabilityonly) if the system always returns to it after small disturbances. If ab < kl,we will achieve an unstable equilibrium: The system moves away from theequilibrium after small disturbances. We will show that a modified version ofthe model can produce alternating phases of stability and instability, yieldingvariable and quantifiable results: This can give rise to a richer simulation of


faction dynamics, as alliances can be broken and conflict be ceased temporar-ily, or even war be declared on a permanent basis.

Our investigation is aimed at refining Richardson’s model for use in aCRPG, and has involved three steps: Reinterpreting the model semantics tofit our intended game context, modifying the model to produce a satisfactoryrepresentation of interaction among factions, and finally converting the modeloutput to the input used by a classic CRPG faction system (in our case theNeverwinter Nights 1 or 2 faction system).

3.1 Reinterpreting the Richardson’s Model Semantics

Even though the model created by Richardson is a viable approach to controloverall factions’ behaviour in games, the model was designed with a verycoarse level of granularity in mind. Whilst Richardson was interested in thevery high level picture of the reasons behind a conflict, our goal is to givedesigners the freedom to change a game’s story over time. Hence, we startedour analysis by naming two factions X and Y, and by reinterpreting x and yas the (greater than or equal to zero) level of cooperation of faction X and Yrespectively. We also reinterpreted the parameters of the model as listed inTable 1. The meaning of the parameters is not very different in our version ofthe model, but increasing values will lead to cooperation instead of conflict.

This change aligns the system with the convention used by the NWN 2faction system. The level of cooperation of each faction will lead either to astable equilibrium point P* that yields a steady state of neutrality, or unstableequilibrium that will drive the system towards increasing levels of competi-tion/cooperation (decreasing cooperation indicates competition). Without lossof generality, we will concentrate on a restricted context of unstable equilib-rium: Richardson’s model will be modified in order to obtain a rich behav-iour, and at the same time cater for the interactive scenarios found in modernvideogames. Also, we will assume that g and h are negative (indicating thatthe two factions harbour resentment towards each other).

3.2 Modifying Richardson’s Model

The standard formulation of Richardson’s model in the unstable equilibriumcase implies that the final state of the system will be dictated by the initial

Table 1. The reinterpreted parameters semantics

Parameters Semantics

k Faction X belligerence factorl Faction Y belligerence factora Faction X pacifism factorb Faction Y pacifism factorg Friendliness of X towards Yh Friendliness of Y towards X

126 F. Zambetta

conditions of the system. The initial condition of the system, a point P in thecooperation plane depicted in Fig. 1a,b, will be such that:

• If P lies in the half-plane above the dividing line L*, then the system willbe driven towards infinite cooperation.

• If P lies in the half-plane below the dividing line L*, then the system willbe driven towards infinite competition.

• If P lies on the dividing line L*, then the system will be driven towards astable condition of neutrality.

The problem with this model is that it is uninteresting in an interactive sce-nario, even though it apparently contains all the main ingredients required toproduce a rich behaviour: Once an application starts approximating the solu-tion of the model from its initial condition via an ODE solver [2], the solutionwill be stubbornly uniform and lead to a single outcome in any given run (anyof the three listed above, depending on the initial position of P). To cater forscenarios where PCs (Playing Characters) and NPCs (Non-Playing Charac-ters) interact with each other in the game world, we developed a stop-and-goversion of Richardson’s model: The solution of the system will be initiallycomputed by our ODE solver until an external event is generated in-game.When that happens, the parameters of the model listed in Table 1 are con-veniently recomputed, leading to a possible change in the equilibrium of thesystem: The way parameters are changed allows for the possibility of mov-ing the dividing line L*, thus altering the direction of motion of the currentsystem trajectory. Recalling (3) we have

Anew = λAold (4)λ > 0 .

Now we want to see how scaling A will influence the equilibrium of the system.To do so, let’s first compute the equation of L*, which is the locus of pointswhere both the derivatives in our system will go to zero. The equation of L*will result in

x + y = (ky − ax + g) + (lx − by + h)

= (l − a)x + (k − b)y + (g + h) (5)

= 0 .

The effect of scaling on A will yield

x + y = λ(l − a)x + λ(k − b)y + (g + h) (6)

= 0 .

Thus, we will finally have

(l − a)x + (k − b)y +(g + h)

λ= 0 .


Fig. 2. Effect of scaling A

Three distinct cases will be possible then:

• 0 < λ < 1: L* is moved in its original upper half-plane, giving rise to apossible decrease in cooperation.

• λ = 1: The scale factor does not change A (there is no practical use forthis case, though).

• λ > 1: L* is moved in its original lower half-plane, giving rise to a possibleincrease in cooperation.

To test these claims, the reader needs only to take a look at Fig. 2, wherethe case 0 < λ < 1 is depicted. The dividing line is initially L1, and thepoint describing the trajectory of the system is P: The ODE solver generatesincreasing values of cooperation stopping at P1, because an external event hasjust occurred. At this stage, A gets scaled and as a result of that, the newdividing line becomes L2: The new dividing line brings P1 in the lower half-plane, leading to decreasing values of cooperation (increasing competition).Generalizing the considerations inferred from this last example, suppose thatinitially L1 ·P > 0 (increasing cooperation) and that 0 < λ < 1. Then we willhave three alternatives when an external events occurs:

• L2 · P1 > 0: The level of cooperation keeps on increasing.• L2 · P1 < 0: The level of cooperation starts to decrease.• L2 · P1 = 0: The level of cooperation will move towards a stable value.

Clearly, if L1 · P > 0 and λ > 1 then L2 · P1 > 0. Similar conclusions can bedrawn in the case L1 · P < 0.

Hence, any application using our model will need to provide a set (or ahierarchy) of events, along with a relevance level λj , j ∈ {1 . . . M} that couldbe either precomputed in a lookup table or generated at runtime (λ-values).Obviously, all the events having λj > 1 will correspond to event that facilitatecooperation, whereas events having 0 < λj < 1 will exacerbate competition.

128 F. Zambetta

The effect of the λ-scaling is to change partitioning of the first quadrant,giving rise from time to time either to a bigger semi-plane for cooperation orfor competition.

Finally, the improved Richardson’s model presented here can be charac-terized in terms of an HCP (Hybrid Control Problem) [3]. We will not get intomuch detail to avoid losing the focus of our investigation, but suffice to saythat an HCP is a system involving both continuous dynamics (usually mod-elled via an ODE) and controls (often modelled via a Finite State Machine).The system possesses memory affecting the vector field, which changes dis-continuously in response to external control commands or to hitting specificboundaries: Therefore, it is a natural fit to treat in-game events like controlcommands.

3.3 Converting to the Neverwinter Nights 2 Faction System

Converting the to the NWN 2 faction system is straightforward once theproper values of cooperation have been computed.

A few function calls are available in NWN Script to adjust the reputa-tion of a single NPC (e.g., AdjustReputation) or of an entire faction (e.g.,ga faction rep). In NWN 2 faction standings assume a value in the [0, 100]range per each faction: Values in [0, 10] indicate competition (in NWN 2 hostil-ity), whereas values in [90, 100] represent cooperation (in NWN 2 friendship).

The most straightforward conversion possible would simply use x and yas the faction standings for each faction: x would indicate the way NPCs infaction X would feel about people in faction Y and vice versa, clamping thevalues outside the [0, 100] range. Also, a scaling factor that represents the rel-ative importance of each NPC in a faction can be introduced: It is reasonableto expect that more hostility or friendship would be aroused by people in com-mand positions. Hence, if we split a faction (say X for explanatory purposes)in N different ranks, then we will have some coefficients εi, with i ∈ {1 . . . N}such that

xNWN = x ∗ εi . (7)

4 Scenarios of Use

The conceptual framework our model is based on is illustrated in Fig. 3. Thelevel of cooperation (competition) generated by our model is influenced byplayers actions in game, but the model will alter the game world perceivedby players as well as in a feedback loop. The longer term applications of ourmodel, and the main drivers for our efforts have been navigation and gen-eration of non-linear gameplay. Besides achieving these more complex goalsthough, we also wish to apply our model to the generation of random encoun-ters in a CRPG like Neverwinter Nights.


Fig. 3. Our model conceptual framework

Fig. 4. Representing a game’s non-linear plot

4.1 Navigating Non-Linear Game Narrative

If a game has narrative content arranged in a non-linear story or short episode,we can visualize its structure as a collection of game scenes (see Fig. 4). Eachcircle either represents a scene of the game where choices lead to multiplepaths, or scenes which will just move the storyline along. Also, a start and anend scene will be included.

We envision attaching scripting logic to each of the nodes where a choiceis possible, so that alternative paths are taken based on the current level ofcompetition. Thus, our players will be able to experience different subplotsas a result of their own actions and strategies. From a pragmatic point ofview, exponential growth of non-linear structures has to be kept under controldue to resources implications: A widespread game structure used to preservenon-linear design without leading to unbearable resource consumption, is aconvexity [10]. Each of the nodes containing scripting logic will incorporatefuzzy rules [21], describing which specific actions should be executed basedon the value of fuzzy predicates. We could theoretically use classic logic to

130 F. Zambetta

Fig. 5. Membership functions to model fuzzy cooperation predicates

express these conditions, but fuzzy logic is very good at expressing formalproperties using quasi-natural language. For instance, we might have somescripting logic like below:

IF cooperationX IS LOW THEN Action1

or:

IF cooperation IS AVERAGE THEN Action2

Clearly, opportune fuzzy membership functions are needed, and their currentsetup is depicted in Fig. 5.

The net result will be scripting logic that game designers will be able touse and understand without too much hassle, and which will resemble to someextent natural language.

In practice it will be very likely to have conditions that contain bothfuzzy cooperation predicates and crisp conditions relating to common in-gameevents such as quests completion, items retrieval, etc. in order to trigger scenetransitions. Ultimately, the goal we have in mind is to render a new gamegenre viable, i.e., RPS (Role-Playing Strategic). The best of both worlds,Role-Playing Games and Real Time Strategics, is pursued here as a blendingof the classic story-driven approach familiar to RPG players with strategicgameplay elements.

4.2 Generating Random Encounters in Neverwinter Nights

Random encounters are common place in RPGs, for example to attenuate themonotony of traversing very large game areas. Their main potential flaw is


that attentive players will not suspend their disbelief, because creatures couldbe spawned without any apparent rationale at times. Our model can generatevalues of cooperation/competition over time, and these can be used as cuesfor the application to inform the random encounters generation process.

Supposing we are in a scenario where players joined faction X, their ac-tions will cause specific in-game events able to influence the equilibrium of thesystem. Now, the higher the level of competition of X towards Y, the harderand the more frequent the encounters will be. Also, players will encounterNPCs willing to negotiate truces and/or alliances in case the level of coopera-tion is sufficiently high, in order to render the interaction more believable andimmersive. The way this improved process for random encounters generationcan be designed is by using fuzzy rules, describing which class of encountersshould be spawned based on the level of cooperation.

Possible rules will resemble this form:

IF coopX IS LOW THEN ENCOUNTER IS HARD

or:

IF coopX IS VERY HIGH THEN NEGOT ENCOUNTER IS EASY

Such a mechanism could be used to deter players from using a pure hack-and-slash strategy forcing them to solve puzzles, and concentrate on the storylinenarrated in game.

It should be noted that NWN 2 already provides five classes of standardencounters (very easy, easy, normal, hard, very hard), but they all implic-itly assume players can only take part in hostile encounters. Ultimately, weenvision to extend the existing set of encounters with other five classes of en-counters tailored to negotiation Moreover, the grain of the classes is coarse anda proper defuzzification mechanism could use some of the parameters includedin the classes (e.g., number of monsters spawned, etc.) to render it finer. Asdictated by our conceptual framework, not only will players be able to influ-ence the level of competition in-game, but they will also experience first-handthe effect of the model on the random encounters in the game world.

4.3 A Tool to Create Non-Linear Stories

A tool to create non-linear stories would allow game designers to both inter-actively script the game structure, and make changes to the structure itself.In order to restructure the game narrative it is foreseen that a more complexlanguage will be needed that not only will be able to describe the choicesoccurring in the storyline, but also script more generic game events. The sim-plest (and probably most effective) idea we have been thinking about wouldsee the fuzzy rules systems incorporated through an API exposed by a moregeneric games-friendly scripting language (e.g., Python, Lua, Javascript, etc.).

An example of a language used to script narrative content is given by ABL,a reactive-planning language used to script the beats (dramatic units) in the

132 F. Zambetta

interactive drama Facade [14]. Even though ABL did a good job in scriptingFacade dramatic content, it clearly falls short in terms of complexity of thescriptable actions: All in all, Facade is a piece of interactive drama with aquite sketchy 2D interface, and not a real game (which is what we are reallyinterested in).

Also, people at the University of Alberta proposed an approach based onsoftware patterns to help game designers in story building [8]: Scriptease, thetool they produced, can be used to automate to some extent the scripting oftypical narrative and interaction patterns in Neverwinter Nights. The conceptof a formal structure underpinning a story is not new at all, as it was firstanalyzed at large by Propp in relation to traditional Russian folktales [16].Despite some criticism to Propp’s work, it is our intention to incorporatethe core of its arguments to be able to recombine essential story elements inmultiple ways: This could lead to the generation of new storylines, which canthen be manually refined by game designers and writers with less effort. Idealcandidates for this task are represented by evolutionary algorithms, whosepower of recombination driven by an automatic or semi-automatic fitnessprocedure has been applied to music [15] or graphics [19] and animation [18].Of course, building a tool to forge non-linear stories is a far-reaching goaloutside the scope of our current research, but an intention in our future work.

5 Experimental Results and Discussion

We have not built an entire scenario integrating all the features of our modelyet; hence, we are going to present some results obtained simulating in-gameexternal events via random number generators. We will analyze the solutionsgenerated by the ODE when selecting specific parameter sets. We will examinethe cases listed below:

1. The strong impact on the system of Richardson’s model parameters set.2. The marginal relevance of different starting points.3. The role of events probability distribution, and the correlation with λ-

values.

Moreover, we will provide an example of interaction between fuzzy rules andthe solution computed by the system in a specific scenario: The players areapproaching an NPC, and its attitude towards them depends on the currentlevel of competition between their respective factions (and the fuzzy rules).However, before illustrating our results we will provide some necessary clari-fications on the experimental data.

Firstly, the system trajectories are constrained in a subset of the firstquadrant (I = [0, 100] × [0, 100]). Positive values are needed for both x andy as they represent levels of cooperation. Besides, NWN 2 accepts reputationvalues in the range [0, 100] with lower values indicating a tendency to conflictand antagonism. Secondly, we assumed that if the cooperation value of any


faction falls outside the prescribed range it will be first clamped, and after acertain amount of time reset to random coordinates representing neutrality.This assumption makes sense as we do not want to keep the system in adeadlock for too long a time. The formulas currently used for resetting thesystem trajectory are

x = 50 + 25 ∗ (0.5 − r), (8)y = 50 + 25 ∗ (0.5 − r).

Here r is a random number in the [0, 1] range. Clearly, other formulas couldbe used, but this method produces interesting and robust results. Our ODEsolver, implemented using a Runge-Kutta order 2 (or mid-point) method, hasbeen hooked to the OnHeartbeat event in NWN 2 (invoked every 6 s). Thestate of the system was sampled over 5000 iterations, resulting in a time spanof around 8.3 hours of real-time.

5.1 ODE Solutions

Changing the fundamental parameters of the model gives rise to the situationdepicted in Fig. 6a–c. Increasing the magnitude of the parameters has the ef-fect of causing the system trajectory to bounce more often off the borders, andbeing randomly reset to a new position. In practice, the smaller the coefficientsthe more deterministic the system will be. This can allow game designers tofine tune the parameters value to obtain different political scenarios in theirstorylines, being still able to predict the average behaviour of the system.

The marginal role played by starting points on the long term behaviour ofthe system is no surprise. Given the random nature of the system (induced byexternal events and the reset mechanism) the starting point becomes a smallfactor in the whole picture.

On the other hand, a very important role for the system behaviour is as-sumed by the events probability distribution. We examine a case where onlythree possible events are allowed: One intensifying the cooperation level, theother weakening it, and a last one corresponding to a null event. The effectof this probability distribution is provided below in Fig. 7. If we increase theprobability of one event over the other then we will witness either the systemtrajectories gathering around the origin (uttermost competition) or the oppo-site corner (total cooperation). This conclusion is true in a probabilistic senseonly, because the system can still go through alternating phases. By adjustingthe probability distribution a game designer can adjust the likelihood of ascenario to lean towards cooperation or competition.

Finally, the values of λ for each coefficient play a role similar to the oneof the probability distribution (see Fig. 8). Intuitively, the probability distri-bution acts as a set of weights for the λ-values even though a formal proof ofthis argument still needs to be provided.

134 F. Zambetta

(a) A very simple system trajectory.

(b) A more complex trajectory.

(c) A very complex and non-deterministicsystem trajectory.

Fig. 6. Increasing the magnitude of the parameters causes the system trajectory tobounce more often off the borders

5.2 The Role of Fuzzy Rules

In Sect. 4.1 we have described an approach to navigating non-linear narrative.We present here a scenario based on such ideas that can shed light on the useof fuzzy rules in our system. We will suppose our ODE is computing a solutionover time using a specific parameter set determined using the guidelines given


Fig. 7. The effect of a probability distribution P = {0.05, 0.25, 0.7}

Fig. 8. The effect of λ = {0.025, 1.05}

Fig. 9. Different branches of the story are taken because of different levels of coop-eration

in the previous subsection. Fuzzy rules are created to provide control over thegame story progression. The level of competition in the game will be influencedby the events generated by PCs and NPCs, and this in turn will cause thestory to be channelled to specific branches whose logic is controlled by therules (see Fig. 9).

136 F. Zambetta

For instance, suppose a specific scene of the game revolves around therelationship between the PC and an influential NPC. This character will tendto approach hostile and violent PCs with servile disposition while reactingwith hostility to friendly players, perceiving them as weak. Neutral playerwill be treated with neutral distrust. The rules used in this case are:

IF coopX IS HOSTILE THEN coopY IS FRIENDLY

IF coopX IS FRIENDLY THEN coopY IS HOSTILE

IF coopX IS NEUTRAL THEN coopY IS NEUTRAL

Clearly, coopX is a predicate describing the PC faction predisposition towardsthe NPC, and vice versa for coopY. The fuzzy membership functions used areportrayed in Fig. 5. This simple setup is sufficient to allow for distinct outputsto be generated that result in different routes for the storyline, and hedgeoperators were not necessary in this specific situation. Figure 10 a,b show theoutput surface of the fuzzy inference, and an evaluation example.

(a) The output surface.

(b) An evaluation of the fuzzy inference.

Fig. 10. Our fuzzy rules in action


5.3 Discussion

We plan to analyze the output of the ODE in more depth: More classes ofevents or more complex probability distributions may lead to more interestingbehaviour but possibly at the expense of too much complexity. The interactionbetween the ODE and the fuzzy rules presented here will be further testedand refined. Ultimately, the approach seems to offer very compelling featuresthat may lead to its adoption in real world projects:

1. The ODE output produces variable but stable behaviour that can betweaked at will by game designers and programmers.

2. The fuzzy rules needed to navigate game storylines tend to be simple, andthey are easily modified even by game designers because of their expressivepower.

3. Fuzzy rules also allow for smooth control over the different routes availablein a game story.

6 Conclusions and Future Work

We introduced our modified version of Richardson’s model which, based on astop-and-go variant, provides game designers with a tool to introduce politicalscenarios in their story-driven games and game mods [1]. We have discussedthe formal properties of the model (that can be more formally regarded asa Hybrid Control Problem), and analyzed some stochastic patterns that arelikely to be generated by the factions behaviour. We also analyzed how suchpatterns can interact with a scripting model based on fuzzy rules.

The next step in our work will entail the production of Two Families,a Neverwinter Nights 2 module designed to showcase the properties of ourmodel. Two Families will incorporate both random encounters and a non-linear story as described in this chapter. Clearly, the interaction between theODE and the fuzzy rules will be further refined and improved to cater for thisreal-world scenario. Finally, a validation of the whole framework from a userinteraction perspective will be conducted.

References

1. Definition of a game mod. http://www.answers.com/topic/mod-computer-gaming.

2. W. Boyce and R. DiPrima. Elementary Differential Equations and BoundaryValue Problems. Wiley, Hoboken, 2004.

3. M. S. Branicky. General hybrid dynamical systems: Modeling, analysis, andcontrol. In Hybrid Systems, pages 186–200, 1995.

4. M. E. Bratman. Intentions, Plans, and Practical Reason. Harvard UniversityPress, Cambridge, MA, 1987.

138 F. Zambetta

5. R. Burke and B. Blumberg. Using an ethologically-inspired model to learnapparent temporal causality for planning in synthetic creatures. In First In-ternational Joint Conference on Autonomous Agents and Multiagent Systems,pages 326–333, 2002.

6. C. Crawford. Chris Crawford on Interactive Storytelling. New Riders, Berkeley,2003.

7. C. Crawford. Chris Crawford on Game Design. New Riders, Berkeley, 2004.8. M. Cutumisu, C. Onuczko, D. Szafron, J. Schaeffer, M. McNaughton, T. Roy,

J. Siegel, and M. Carbonaro. Evaluating pattern catalogs – the computer gamesexperience. In Proceedings of the 28th International Conference on SoftwareEngineering (ICSE ’06), pages 132–141, 2006.

9. R. Evans. AI Game Programming Wisdom, chapter Varieties of Learning, pages567–578. Charles River Media, Hingham, 2002.

10. N. Falstein. Introduction to Game Development, chapter Understanding Fun:The Theory of Natural Funativity, pages 71–97. Charles River Media, Hingham,2005.

11. G. Freytag. Freytag’s Technique of the Drama. Griggs, Boston, 1995.12. D. Fu and R. Houlette. AI Game Programming Wisdom 2, chapter The ultimate

guide to FSMs in games, pages 283–302. Charles River Media, Hingham, 2004.13. R. Hunicke, M. LeBlanc, and R. Zubek. MDA: A formal approach to game design

and game research. In Proceedings of the AAAI-04 Workshop on Challenges inGame AI, pages 1–5, 2004. Available online at http://www.cs.northwestern.edu/hunicke/pubs/MDA.pdf.

14. M. Mateas and A. Stern. Structuring content in the facade interactive dramaarchitecture. In AIIDE, pages 93–98, 2005.

15. E. R. Miranda and A. Biles, editors. Evolutionary Computer Music. Springer,New York, 2007.

16. V. Propp. Morphology of the Folktale. University of Texas Press, Austin, 1968.17. L. Richardson. Arms and Insecurity. Boxwood, Pittsburgh, 1960.18. K. Sims. Artificial evolution for computer graphics. In Proceedings of the SIG-

GRAPH Conference, pages 319–328, 1991.19. K. Sims. Evolving virtual creatures. In Proceedings of the SIGGRAPH Confer-

ence, pages 15–22, 1994.20. R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press,

Cambridge, MA, 1998.21. L. Zadeh. Outline of a new approach to the analysis of complex systemss. IEEE

Transactions on Man, Systems and Cybernetics, 3:28–44, 1973.

A Time Interval String Model for Annotatingand Searching Linear Continuous Media

Ken Nakayama1, Kazunori Yamaguchi2, Theodorus Eric Setiadi3, YoshitakeKobayashi3, Mamoru Maekawa3, Yoshihisa Nitta4, and Akihiko Ohsuga3

1 Institute for Mathematics and Computer Science, Tsuda College2-1-1 Tsuda-cho, Kodaira-shi, Tokyo 187-8577, [email protected]

2 Graduate School of Arts and Sciences, The University of Tokyo3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, [email protected]

3 Graduate School of Information Systems, University of Electro-Communications1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, [email protected], [email protected],[email protected], [email protected]

4 Department of Computer Science, Tsuda College2-1-1 Tsuda-cho, Kodaira-shi, Tokyo 187-8577, [email protected]

Summary. Time flow is the distinctive structure of various kinds of data, such asmultimedia movie, electrocardiogram, and stock price quote. To make good use ofthese data, locating desired instant or interval along the time is indispensable. Inaddition to domain specific methods like automatic TV program segmentation, thereshould be a common means to search these data according to the changes along thetime flow.

In this chapter, I-string and I-regular expression framework is presented to-gether with some examples and a matching algorithm. I-string is a symbolic string-like annotation model for continuous media which has a virtual continuous branch-less time flow. I-regular expression is a pattern language over I-string, which is anextension of conventional regular expression for text search. Although continuousmedia are often treated as a sequence of time-sliced data in practice, the frameworkadopts continuous time flow. This abstraction allows the annotation and searchquery be independent from low-level implementation such as frame rate.

1 Introduction

When processing data with time flow, such as movies and continuouslyobserved data from sensors, the order of what happened and their time lengthare the most important characteristics. Common model and tools depicting

K. Nakayama et al.: A Time Interval String Model for Annotating and Searching Linear

Continuous Media, Studies in Computational Intelligence (SCI) 96, 139–163 (2008)


140 K. Nakayama et al.

these characteristics will be a good basis for managing such data, allowing theusers concentrate on domain specific analysis of the data.

In this chapter, I-string and I-regular expression framework is presented,that are a symbolic string-like annotation framework for continuous mediaand a pattern language over the annotation, respectively.

1.1 Linear Continuous Media

Continuous time flow is one of the most prominent and universal structureof our world. Linear continuous media is a data which has continuous linear(not branching) time flow as its structure, such as multimedia stream andscientific monitoring data. There are various continuous time media. Videoand audio streams are used as carriers for a wide range of contents such asnews, drama, and music. There also exist domain specific continuous mediasuch as earthquake waveform in seismology, electrocardiogram, and financialstock quote. State-of-the-art technology of capturing, mass storage, and broad-band communication makes a considerable amount of such media available.

The real value, beyond the basic availability just as a collection of media,will be in the capability of accumulating knowledge on those media as anno-tation. This enables searching through the media in response to a variety ofuser’s requests. Now, the demand is on an effective way for searching a portionin the current interest from the archives. To effectively use such bulky archives,a concise way for searching and editing continuous media is indispensable.

1.2 Motivation

The most important characteristics for continuous media are the order of whathappened, and their time length. It is very common to process continuousmedia depending on such conditions. For example, one may characterize thescoring scene of a video taped soccer game by “large white rectangular (thegoal gate) appearing in the scene longer than 1.2 s followed by slowly changingframes longer than 2.3 s (slow replay).” One may characterize his own “buysignal” for a stock by “the time when the price goes down or up less than0.2% longer than five business days.”

The process can be divided into domain specific analysis and order-and-time related analysis. In the above examples, “recognize white rectangle,”“recognize slowly changing frames,” or “recognize daily stock price change lessthan 0.2% up” are domain specific. On the other hand, “recognize the timethat is longer than. . . ” are order-and-time related analysis. If provided a goodframework and tools for order-and-time related processing, user can easilydefine, modify, and combine these conditions. This allows the user concentrateon domain specific analysis. This overview is shown in Fig. 1.

In computers, continuous media is usually treated as a sequence of dis-crete data. For example, a movie is a sequence of “frames.” Explicit discretetreatment introduces undesirable, not-essential artificial quantization in the

A Time Interval String Model for Continuous Media 141

b+ (m m *|n n *) (a+ | b+)*

long (0, 1.4]

long

a2.8 a7.1 a3.9b3.8 b1.1 b1.1m n n m

Modeled as continuous media by abstracting “frame rate”

Annotation as an I-string produced by the analyzer

Stock price quote

Video stream

Trajectory of pedestrianMedia Domains

Raw data

Domain specific analyzer

Continuous time flow

Domain Specific Analysis

Order-and-Time Analysis

Match 3

Match 1Match 2

I-regular expression (pattern against I-string)

Fig. 1. The overview of the scheme

time flow. This prohibits the clear separation between abstract editing seman-tics and low level implementation such as frame rate. We do not impose suchrestriction on the model.

Continuous media is often edited, cut at some point, extract some part,change order, and concatenated. Basically, annotations associated with themedia should be retained throughout these operations. We would like to makethe model realize this naturally.

1.3 Design of the Framework

As an intuitively natural form of annotations, we have adopted a linear stringthat has clear correspondence to the original continuous media. When cut-ting and concatenating a continuous media, annotations are retained, too, byperforming the operations parallel to the original ones on the media.

To make the annotation independent from the low level “frame rate,”the annotation string should be virtually continuous, that is, the annotationcan be divided at any time position. By abstracting the low level quantized


representation of time flow, operations defined on the virtual continuous timecan be applicable to media regardless of its “frame rate.”

Based on the observation that annotation for both an interval of and a spe-cific point of continuous media is necessary, we identify two types of attributesymbols: a symbol for a finite time interval, and a symbol for an instant. Forexample, when a domain specific recognizer locates a time position to cut themedia, annotation on that instant of the time is necessary. On the other hand,a recognizer may identify some interval of time in which the media satisfiessome conditions, for example, the temperature is below the dew point for therecord of a thermal sensor. Thus, annotation for an interval is necessary.

As a pattern language for the annotation string, conventional regular ex-pression is extended. Conventional regular expression [6] is a commonly usedtool for searching character strings. This makes the pattern language easierto understand and learn to use.

This chapter presents a framework in which users can express their inten-tions easily and naturally. This work provides complete matching algorithmby extending our previous work [4]. Emphasis is put on the logical expressive-ness. We do not discuss a fancy user interface, nor a physical access methodin this chapter.

2 Annotation and Search Model for Continuous Media

The proposed framework for annotating/manipulating linear continuous me-dia consists of two models (1) I-string annotation model and (2) I-regularexpression search model based on the I-string annotation. The frameworkprovides a concise way for searching and editing the media. I-string repre-sents continuous media’s content reflecting the purpose of the search done onit. I-regular expression is a pattern language for I-strings. Since the syntaxand intuitive semantics of I-regular expression is similar to the conventionalregular expression, it would be easy to use.

Annotation should be able to refer to some temporal part of a media,since linear continuous media is characterized by its time coordinate. I-stringannotation is in the form of string which consists of two types of annota-tion symbols: one for attributes at a specific moment and the other for atime interval on a media. In the framework, a continuous media is annotatedwith descriptive information, or attributes, reflecting its content and purpose.These attributes may be given manually or by automatic indexing [9]. Theway of expressing a raw continuous media data as an I-string is beyond thescope of this work. Attribute may be extracted from the raw data either man-ually or automatically by image processing, or is elaborated by a specializededitor as additional value. So, the assumption that we are going to work on theattributes is not unrealistic and most existing system relies on the assumption.


2.1 I-String: D-Symbol and I-Symbol

An I-string is a string of two types of symbols, namely I-symbol and D-symbol. An I-symbol for example v2.6, has a positive time duration, depict-ing the attribute for that time interval, while a D-symbol, for example g•,represents the attribute of an instant. Without D-symbol, we have to assignan artificially small amount of time for an instant event. So, these two typesof attributes are mandatory for modeling a continuous media naturally.

The suffix of an I-symbol is called time duration, which represents the timeduration of the event and should be a positive real number. For example, thetime duration of v2.6 is 2.6. Within the order-and-time related analysis, eachof symbols like v and g is treated as just a symbol to which any meaning(attribute) can be associated by domain specific analyzers, say “the content isdrama” for a TV program, “the position is in a specific area” for a trajectoryof a moving robot. I-string annotation model is illustrated in Fig. 2.

Now, consider the video data in which (commercial) ad. lasts 2 min, dramalasts 10 min, ad. lasts 6 min, drama lasts 2 min, ad. lasts 5 min, drama lasts14 min, and ad. lasts 6 min as illustrated in Fig. 3a. Let v be the symbol forthe ad. attribute and w for the drama attribute. Suppose that a machine withtwo states v and w is recorded onto a 55-min movie. The state changes asthe time advances. Using I-symbols v+ and w+, the annotation would be anI-string v2 w10 v6 w12 v5 w14 v6.

Continuous time flow

a2.8 a7.1 a3.9b3.8 b1.1 b1.1m n n m

Video stream

Annotation as an I-string

Annotation Model

a3.9nI-symbol

D-symbolAttribute for that instant

Attribute for that interval Time duration

Fig. 2. I-string annotation model

v2 w10 v6 v5 v6w12 w14

(a)

ev2 w10 v6 v5 v6w12 w9 w5

(b)

Fig. 3. Symbolic representation of a continuous media


In addition to the state changes, an event occurring at a specific momentcan be represented with D-symbols. For example, we may use an attribute e•for marking the climax (Fig. 3b). If the climax event e comes at 5 min from theend of the 14-min drama fragment, the I-symbol w14 is split into w9 and w5,then D-symbol e• is placed between them, getting w9 e• w5. Notice that thetime duration of w9 e• w5 remains 14. In this way, we can construct a stringof attributes for describing the continuous media. Here, we assume that eachattribute is represented by a symbol.

2.2 I-Regular Expression

I-regular expression is a search pattern against I-string, which providessimple but expressive means for continuous media searching. Once contin-uous media are annotated with I-strings, a variety of search can be done us-ing I-regular expression based on that attributes. Suppose that progress andtreatment of a patient is described as an I-string by encoding the patient’scondition with a and b, representing “normal” and “critical”, respectively,and giving a tablet of two types of medication with m• and n•, respectively.Then, an I-regular expression query

b+(m•m∗• | n•n

∗•) (a+ | b+)∗

(0, 1.4](1)

matches a part of the record where the patient is in the critical condition,then given one or more tablets of one of the two types of medicine, but notmixed, together with maximum 1.4 h of the progress after the treatment. Thesearch model is illustrated in Fig. 4.

I-regular expression can specify (1) order of occurrences of symbols, to-gether with (2) constraints on the time duration of a specific portion. As thedomain of time, we adopt a real number, and for an interval, we adopt an in-terval of real numbers. Some system [8] limits the time domain to the integralnumbers implicitly. This may cause inconvenience when media with differenceframe rate are treated together.

I-regular expression is an extension of the conventional regular expres-sion [6]. The extensions are (1) symbols which match any symbol length of I-symbol, and (2) time interval constraints (I-constraint). D-symbols are equiv-alent with conventional symbols, or string characters. In other words, if youuse only D-symbols and other constructs but I-constraint, that is equivalentto the regular expression with which you are familiar in text editors. Sincethe regular expression is commonly used for specifying patterns for characterstrings, the proposed search framework would be easy to understand for awide range of users.

The pattern matcher, presented in the later sections, enumerates possiblematches of the given I-regular expression and I-string, and extracts cor-responding sub-I-strings if extraction directive is specified in the I-regularexpression.


I-string

Search Model

b+

Matches arbitrary positive time duration

b+ (m m *|n n *) (a+ | b+)*

long (0, 1.4]

long

1.41.4

I-regular expression (pattern against I-string)

long

As long as possible m

Matches exactly one D-symbol

*

Repetition of 0 or more times

I-symbol can be divided at any position

| Choice

(0, 1.4]

Longer than 0, and Shorter than or equal to 1.4

Match 3

Match 1Match 2

Fig. 4. Search model using I-regular expression

2.3 Related Work

In OVID [5], an interval is expressed as a sequence of video frames, and theoperations of the interval logic [1] are used for their manipulation. This methodis simple, but for a user, to say “the drama lasts 30 min” is better than to say“the drama lasts from frame#:2540 to frame#:3439.” So, the frame orientedtime model is not suitable for a user’s query description. In OVID, a kind oflogical time can be expressed by an attribute “Year” and its value “1974,”however, the time oriented semantics of the attribute domain is not explicitlystated. In [8], for an event calculus, logical time duration is used. The logicaltime duration is independent of the underlying sampling rate. This propertyis suitable for the use in queries. Using the logical time, we encode the partthat an attribute a lasts for l min by al.

The finite automaton [6] is a well-known machine model for the regularexpression. The extension of the finite automata theory is shown in the lastpart of Sect. 4. The automaton shown in Sect. 4 is nondeterministic and thestandard determinization procedure known as the subset construction is notapplicable because the alphabet in our model is not a finite set. We developedan effective depth first search procedure for determining acceptance/rejectionin Sect. 5. In the remaining of this chapter, we describe the framework ratherformally, so that it serves as the foundation for further study.


3 I-String Annotation Model for Continuous Media

3.1 I-String

I-string is an annotation model for continuous media. There are two types ofannotation symbols: one for an instant , and the other for an interval of time.Let Σ• and Σ+ be mutually disjoint (Σ• ∩ Σ+ = ∅) finite sets of symbols.Each symbol in Σ• is called D-symbol (discrete), which denotes an event fora specific instant. Each symbol in Σ+ is called I-symbol (interval), whichdenotes a state that lasts for some time duration. For clarity, D-symbol ismarked with ‘•’ like m•, n•, while I-symbol is written with ‘+’ like v+, w+.

I-string over alphabet (Σ•, Σ+) is a sequence of symbols σ1σ2 · · ·σn (σi ∈Σ• ∪ Σ+, 1 ≤ i ≤ n) with associated time durations d1d2 · · · dn. The timeduration is also called I-length. Since Σ• denotes an instant, di = 0 if σi ∈ Σ•,while di > 0 for σi ∈ Σ+. Notice that 0 and negative time duration is notallowed for symbols in Σ+. The empty string is ε.

As a simple notation for I-strings, we will omit time duration 0 for Σ•and write di in place of ‘+’ for Σ+. For example, when Σ• = {m•, n•} andΣ+ = {v+, w+}, “v5.5v0.2m•n•w1.3” is a shorthand for the following I-string:

i 1 2 3 4 5Symbol v+ v+ m• n• w+

I-length (time duration) di 5.5 0.2 0 0 1.3

Some examples of I-string are as follows:

ε, m•, m•m•n•, v1w2, v5.5v0.2m•n•w1.3, m•v3m•

If I-strings α1 and α2 are the same as a sequence of symbols, we denote itby α1 ≡ α2. Two kinds of lengths are defined for an I-string. For an I-stringα, its symbol length refers to the number of symbols, while its I-length I(α) isthe sum of time durations. For example, when α = u5.4 u3.5 g•v1.8 m• m• v5.9,its symbol length is 7 and I(α) = 5.4+3.5+1.8+5.9 = 16.6. We assume thatI-string has finite symbol length and I-length.

3.2 I-Normal Form of I-String

One of the basic operations necessary for editing I-strings is concatenationand substring extraction. We would like to introduce intuitive naturalness intothe interpretation of I-string. Suppose that we have a 60-min long surveillancemovie taken at a traffic crossing. If there is no traffic accident during that60 min, its annotation would be v60 by denoting “no-accident” situation withv+. If you cut the movie into two short movies with 25 min and 35 min, theirannotations should be v25 and v35, respectively. If you concatenate these twoshort movies, you would expect to get the original 60-min long one whose


Table 1. Equivalent I-strings with respect to =

I-string Symbol length

v5m•v1 3 (minimum) I-normal formv2 v3m•v1 4v1 v4m•v1 4v4.5 v0.1 v0.4m•v0.5v0.5 6

annotation should be v60. If v3.2 is followed by v3.8, we see that the attributev lasts in 7 unit time without interrupts. So, we may identify v3.2v3.8 with v7.

This suggests that an I-symbol, say v7, should be able to be arbitrarilydivided into v3v4 or v1.8v3.1v2.1, or concatenated back to the original I-symbolas long as its I-length remains the same. To reflect this, we introduce anequivalence relation = over I-strings. For any successive occurrence of thesame I-symbols in an I-string, such as vd1vd2 · · · vdm

and ve1ve2 · · · ven, they

are equivalent to each other with respect to = iff their sums of I-length arethe same each other:

vd1vd2 · · · vdm= ve1ve2 · · · ven

, where d1 + · · · + dm = e1 + · · · + en. (2)

Among the equivalent I-strings with respect to =, there exists aunique I-string which has the minimum symbol length. We call such anI-string I-normal form. You can get the I-normal form by merging allsame I-symbols appearing adjacently. For example, the I-normal form ofv4.5 v0.1 v0.4m•v0.5v0.5 would be v5m•v1 (Table 1).

On the contrary, no such relation is defined for successive D-symbols. D-symbols are “not dividable,” that is, the number of occurrence, for example3 for m•m•m•, is significant.

4 I-Regular Expression

4.1 I-Regular Expression and Its Regular Language

I-regular expression is a pattern for I-strings. I-regular expression r repre-sents a set of I-strings, L (r), which is called the regular language definedby r. I-regular expression over alphabet (Σ•, Σ+) is defined recursively asshown in Table 2. I-regular expression ε matches to an empty I-string ε, aD-symbol m• as an I-regular expression matches exactly on occurrence of m•in an I-string, and an I-symbol v+ matches arbitrary positive time durationof that symbol in an I-string. These primitive I-regular expressions can becombined by choice, concatenation, or repetition operators recursively.

I-symbol and I-constraint are the extensions, and the remaining is thesame with the conventional regular expression. I-symbol as an I-regular ex-pression matches arbitrary I-length of the I-symbols in an I-string.


Table 2. Definition of I-regular expression over alphabet (Σ•, Σ+)

I-regular expression Regular language(r, r1, and r2 are I-regular expressions) (set of I-strings)

Empty I-string ε L (ε) = { ε }D-symbol m• ∈ Σ• L (m•) = {m• }I-symbol v+ ∈ Σ+ L (v+) = { vl | 0 < l }

Choice (r1 | r2) L ((r1 | r2)) = L (r1) ∪ L (r2)Concatenation (r1r2) L ((r1r2)) = L (r1) L (r2)Repetition (r∗) L ((r∗)) = L (ε) ∪ L (r) ∪ L (rr)∪· · ·

For a non-negative continuous interval Λ,

I-constraintr

ΛL

(r

Λ

)= {α | α ∈ L (r) , I(α) ∈ Λ}

I-constraint restricts the I-length of the specified part (sub-I-regular ex-pression) of I-string. λ is a non-negative real interval such as (0, 3.1] or[22.9, ∞). Each end point at the interval boundary may either be open orclosed, independently. Parentheses may be omitted unless it becomes ambigu-ous. For example, parentheses in (rs)∗ should not be removed since it can beconfused with r(s∗). If we want to disregard m• and identify v9m•v6 with v15,we can use the following pattern:

(v+ | m•)∗

[15, 15]. (3)

Some other examples of I-regular expression and corresponding regular lan-guage are shown in Table 3. Notice that I-strings are compared based on theequality = in L (r).

4.2 Reference to Substring of Match Result

The use of I-regular expression is twofold (1) Yes/No query, and (2) substringreference to the matching result. We say “I-string α matches the pattern r”iff α ∈ L (r), otherwise “α does not match r.” The simplest type of queryis “Does I-string α match the pattern r?” The answer would be Yes or No.Suppose that there is a movie of a car race for three cars a, b, and c. If we areinterested in the change of the leading car during the race, it can be annotatedas an I-string over alphabet (Σ•, Σ+) = ({}, {a+, b+, c+}). For example, theannotation might be an I-string below:

c8 a12 b4.5 c1.8 b0.5 a6.3 c11 b14.8 c2 . (4)

“Does b win the game?” The leader at the end of the race is the winner. Ifthe I-string α matches the following I-regular expression, the answer is Yes.

(a+ | b+ | c+)∗b+ . (5)


Table 3. Examples of I-regular expression and its regular language

I-regular expression r Regular language L (r)

ε {ε}m• {m•}v+ {vd | 0 < d}v+m•v+ {vd1m•vd2 | 0 < d1, 0 < d2}v+ | m• {vd, m• | 0 < d}m∗

• {ε, m•, m•m•, m•m•m•, . . .}v+

[5.7, 5.7]{v5.7}

v+

(0, 2.93){vd | 0 < d < 2.93}

v+

[0.77, 0.77]

∗{ε, v0.77, v0.77v0.77, v0.77v0.77v0.77, . . .}

= {ε, v0.77, v1.54, v2.31, . . .}v∗+

[0.77, 0.77]{v0.77}

v+

(7.18,∞)

∗{ε, vd11 , vd21vd22 , vd31vd32 , vd33 , . . . | 7.18 < dij}

= {ε, ve | 7.18 < e}v+

(6, 8]

∗{ε, vd11 , vd21vd22 , vd31vd32 , vd33 , . . . | 6 < dij ≤ 8}

= {ε, ve1 , ve2 , ve3 | 6 < e1 ≤ 8, 12 < e2 ≤ 16, 18 < e3}(v+

[2.51, 2.51]

m•)∗

n• {n•, v2.51m•n•, v2.51m•v2.51m•n•, . . .}(v+

[2, 2]

m•m• v+

(0, 1)

)∗

{ε, v2m•m•ve1 , v2m•m•vd21m•m•ve2 , . . .

| 2 < dij < 3, 0 < ei < 1}

Show me the portion that c grabs the top from b, but b takes it back inless than 3 min.

(a+ | b+ | c+)∗(b+ c+

(0, 3)b+)(a+ | b+ | c+)∗ . (6)

“Does c keep the leader for more than 10 min?”

(a+ | b+ | c+)∗ c+

(10,∞)(a+ | b+ | c+)∗ . (7)

For the above query, you might want to watch the scene when c is beingthe leader. For an I-string α ∈ L (r), r can be used to designate substringsof interest for extraction from α. The substring reference and the matchingdirectives are used for this purpose. To refer to a substring which matchedsubpattern s, we use the reference

s<X>

, (8)


where X is an arbitrary name for this reference. For instance, after matchingthe following I-regular expression, the matched substring can be referred by X:

(a+ | b+ | c+)∗ c+

[10,∞)< X >

(a+ | b+ | c+)∗ . (9)

• Show me the heated battle of b and c.– Show me the portion that c grabs the top from b.

(a+ | b+ | c+)∗ b+c+

< X >(a+ | b+ | c+)∗ . (10)

– Show me the portion that b or c grabs the top from the other.

(a+ | b+ | c+)∗ (b+c+ | c+b+)< X >

(a+ | b+ | c+)∗ . (11)

• Show me the portion that b or c runs on top, and each keeps the top forless than 10 min.

(a+ | b+ | c+)∗a+b+ | c+

(0, 10)| U | V

< X >

a+(a+ | b+ | c+)∗ , (12)

where

U ≡ b+

(0, 10)c+

(0, 10)

(b+

(0, 10)c+

(0, 10)

)∗(b+

(0, 10)| ε

),

V ≡ c+

(0, 10)b+

(0, 10)

(c+

(0, 10)b+

(0, 10)

)∗(c+

(0, 10)| ε

).

Intuitively, U represents alternating sequences starting with b+:

{b+c+, b+c+b+, b+c+b+c+, b+c+b+c+b+, . . .},

and V represents similar ones starting with c+.• Show me the winner.

(a+ | b+ | c+)∗ (a+ | b+ | c+)< X >

. (13)

4.3 Further Examples of I-Regular Expression

Soccer Game

The video on the soccer game can be encoded into an I-string. Let an I-symbol a denote that a team A controls the ball, and b for a team B. If


neither controls the ball, an I-symbol c is used. A D-symbol g• is used tomark the goal. We assume that the team which controls the ball just beforethe goal gains the point. For example, the code might be

a8 b4 g• c1a3 b2 c1 a7 g• c1a4 b3 a5 b5 c1 . (14)

Now, we show various queries are expressible in I-regular expression. We useU ≡ (a+ | b+ | c+) to make the expressions easy to understand.

• Show me the first goal of the game.

U∗ g•< X >

(g• | U)∗ . (15)

• Show me the first goal of the team A.

U∗ a+g•< X >

(g• | U)∗ . (16)

• Show me the second goal of the game with 15 s before the goal and 30 safter the goal. The requested range may be truncated if the goal is justafter the start of the game or just before the end of the game.

U∗g•U∗ U

(0, 0.25]g• U

(0, 0.5]< X >

(g• | U)∗ . (17)

• Find the goal in the time-out extension.Equivalently in I-regular expression, we can say that “find the goal after45 min. from the start of the game.” This I-regular expression will matcheven if no such goal is present in the I-string, but nothing will be assignedto <X>.

(U | g•)∗

[45, 45](U | g•

< X >)∗ . (18)

• Find two goals in less than 10 min.

(g• | U)∗ g•U∗g•

(0, 10)(g• | U)∗ . (19)

The match/fail corresponds to YES/NO for this query.

Electrocardiographic Diagnosis

The record of the electrocardiogram can be encoded into an I-string by D-symbols for notable peaks and an I-symbol v+ for the time filler (Fig. 5). Forexample, the code might be

v200 p• v89 q• v50 r• v23 s• v270 t• v180 p• v90 q• v57 r• v19 s• v260 t• . (20)


v200p•v89q•v50r•v23s•v270t•v180p•v90q•v57r•v19s•v260t•

Fig. 5. An I-string example for ECG

Here, we show that various conditions can be expressed by I-regular ex-pressions. Let U• ≡ (p• | q• | r• | s• | t• | · · · ), V ≡ (U• | v+), andR• ≡ (p• | q• | s• | t• | · · · ) (all D-symbols except r•).

• Find the rapid heart beats.Equivalently in the I-regular expression, “find the portion that the timeinterval between successive r•s is less than 400 ms.”

V ∗ r•(R• | v+)∗r•(0, 400]

V ∗ . (21)

• Find the portion of the heart failure.Equivalently in the I-regular expression, “find the portion that three R-to-R intervals which are at least 600 ms long are followed by R-to-R intervalwhich is at most 400 ms long.”

V ∗ r•(R• | v+)∗r•(600,∞)

(R• | v+)∗r•(600,∞)

(R• | v+)∗r•(600,∞)

(R• | v+)∗r•(0, 400]

V ∗ . (22)

4.4 Matching Preference Directive

Selection from Multiple Solutions

Now let us consider the situation that the constraints for the accepted path(see Sect. 5.2 for the definition) have uniquely undecidable multiple solutions.Since an I-symbol may be arbitrarily partitioned, and vice versa, the <X>of I-regular expression matching may contain ambiguity in I-length. Onetypical example is the pattern v+v+. When this pattern matches an I-stringv5, the first v+ can take arbitrary I-length between 0 and 5. Before extractingsubstrings, such I-lengths should be settled.

For this purpose, optional directives “long” and “short” declare the pref-erence in the I-length. A sub-I-regular expression with “long” directive is


assigned the longest possible substring, and the same with “short.” For in-stance, when an I-regular expression

v+

[2,∞)v+

long(23)

matches an I-string v5, the latter part takes the priority of getting the longestpossible substring, v3, leaving the shortest substring, v2, for the former part.Let the I-length of v[2,∞) be x and the I-length of (v+) be y. The searchprocess generates the constraints that x + y = 5 and x ≥ 2 at its success.As we express our preference that y should be as long as possible, we havey = 3 and x = 2 as a solution. The shortest I-length without lower bound isimplementation dependent small value predefined by user, say 0.0001. Whenan I-regular expression

v+ v+

long(24)

matches an I-string v5, the substrings for the former and latter parts wouldbe v0.0001 and v4.9999, respectively. For the input I-string wl and I-regularexpression w∗

+, the sequence of transitions from the initial state to the finalstate may become arbitrarily long, because the time duration for each matchmay become arbitrarily small. In the next section we show more tractabledepth first search model.

5 I-String Recognition by I-Automaton

In this section, a declarative definition of string acceptance/rejection by anNFA, in terms of a path, is given. The language defined by a conventionalregular expression can be recognized by a nondeterministic finite automaton(NFA). We adopt the same scheme for I-regular expression. An I-regular ex-pression is translated into an equivalent nondeterministic finite I-automatonI-NFA which is an extension of conventional NFA. I-string recognition isdone using this I-NFA. We first review the conventional regular expressionand NFA, and then we will extend it to I-NFA.

5.1 Conventional Nondeterministic Finite Automaton (NFA)

The conventional nondeterministic finite automaton [6] is defined as

(Q,Σ, δ, q0, F ), (25)

where Q is a set of states, Σ is alphabet, δ : Q×Σ → 2Q is a state transitionfunction, q0 ∈ Q is the initial state, F ⊆ Q is the set of final states. Anautomaton can change its state from q to q′ if q′ ∈ δ(q, s) by reading a symbol


fi

ε

fm

i

m• ∈ Σ•

f

r1

r2

i

(r1 | r2)Let two I-automata share

the state i of r1 and r2

and f of r1 and r2

fr1 r2i

(r1r2)Let the state f of r1 be i of r2

fr

i

(r∗)Make new states i and f ,

and add ε transition

Fig. 6. Translation rules from a regular expression to an NFA

s or without reading any symbol (transition by ε). This is state transition,and the symbol s is transition symbol.

For a given regular expression r, an NFA that recognizes the language L (r)can be obtained by recursively applying the rules shown in Fig. 6. In a diagramof NFA, each state is drawn as a small circle. We may draw its identifier, forexample q3, within the circle when necessary. The initial state and the finalstates are drawn as double circles, labeled with “i” and “f ,” respectively. Anarrow between two states represents possible state transitions defined by δ.Its transition symbol is labeled along each arrow. Enclosed region as r, r1, orr2 represents sub-NFA.

The NFA produced by the translation from a conventional regularexpression r, recognizes L (r). This can be proved by the induction onthe construction of the regular expression.

5.2 Conventional String Recognition by NFA

For the conventional finite automaton, the acceptance/rejection of an inputstring is defined using path which is a sequence of state and transition. A pathis a track of state transitions on an NFA. A sequence is a path on an NFAiff each state qi in the sequence, except for the first one, is the result of thetransition function δ from the previous state with a transition symbol ti+1

(qi+1 ∈ δ(qi, ti+1)). For example, the following is a path from state q0 to q5:


i 0 1 2 3 4 5State q0 → q2 → q1 → q4 → q2 → q5

Transition symbol s1 s2 ε s3 s4

where each transition is defined like q2 ∈ δ(q0, s1), q1 ∈ δ(q2, s2), . . ..An input string s1s2 · · · sm (si ∈ Σ) is accepted if there is a path that

satisfies all the following conditions:

1. The first state q0 of the path is the initial state of the NFA.2. The last state of the path is in the set of final states F (qn ∈ F ).3. The sequence of transition symbols is equivalent with the input string. In

the above example, the input string s1s2s3s4 is equivalent to the sequenceof transition symbols s1s2εs3s4, since ε means the empty string.

5.3 Nondeterministic Finite I-Automaton (I-NFA)

Now, we are ready to introduce I-automaton. The I-automaton has two ad-ditional constructs to the conventional finite automaton [2]. An I-automatonis defined as

(Q, (Σ•, Σ+), δ, q0, F, Γ ), (26)

where we use symbols in Σ+ as transition symbols in addition to Σ• ∪ {ε}.Γ = {(qi, qf , Qi, Λ), . . .} is a set of I-constraints.

To graphically represent an I-constraint, we draw a dotted box aroundQi, and place qi and qf on the border. By definition, if the I-automaton isnot trivial, qi has a transition from the outside of the dotted box, and qf hasa transition to the outside of the dotted box. So, we can distinguish qi and qf

on the diagram. Λ is placed just below the right bottom corner of the dottedbox. For example, in Fig. 8, the I-length from the state q2 to q3 should begreater than 0 and less or equal to 30.

For a given I-regular expression r, an I-automaton which recognizes thelanguage L (r) can be obtained by recursively applying the rules shown inFig. 7. A translation example is shown in Fig. 8. The I-automaton producedby the translation from an I-regular expression r, recognizes L (r). This canbe proved by the induction on the construction of the I-regular expression.

I-constraint (qi, qf , Qi, Λ) should satisfy the following conditions:

+v fi

v+ ∈ Σ+

Transition for I-symbol

fr

Λ

i

(r

Λ

)I-constraint

Fig. 7. Extended translation rules from I-regular expression I-automaton


I-regular expression:

v+ ( v+

(0, 30]

g• v+

(0, 10]

)

[15,∞)

(v+ | g•)∗

V+ V+ V+

g

V+

(0, 10]

q0g

(0, 30]

[15, )

q1 q3 q4 q6 q7 q8 q9q2 q5

Fig. 8. I-regular expression and translated equivalent I-automaton

• qi, qf ∈ Qi, Qi ⊆ Q and qi = qf . qi and qf are entrance and exit states,respectively.

• Λ is a non-negative real interval and each end of the interval may be openor closed, independently.

• All transitions from the outside of Qi (that is, Q − Qi) into Qi should beto the entrance state qi.

• All transitions from Qi to the outside (that is, Q−Qi) should be from theexit state qf .

For any (qi, qf , Qi, Λ) ∈ Γ and (q′i, q′f , Q′

i, Λ′) ∈ Γ , if Qi ∩ Q′

i = ∅ then eitherQi ⊆ Qi′ or Q′

i ⊆ Qi holds.

Accepting/Rejecting an I-String

An I-path is a path with I-lengths for each transition. For example, thefollowing is an I-path:

i 0 1 2 3 4 5State q0 → q2 → q1 → q4 → q2 → q5

Transition symbol ti v+ v+ ε m• w+

I-length(xi) 5.5 0.2 0 0 1.3

For an I-automaton (Q,Σ, δ, q0, F, Γ ) where Γ = {(qi, qf , Qi, Λ), . . .}, an in-put I-string is accepted by the I-automaton iff there exists an I-path p thatsatisfies the following conditions; otherwise, it is rejected .

• The first state of p is the initial state q0.• qi+1 ∈ δ(qi, ti+1).• The input I-string is symbol-equivalent to the sequence of transition

symbols t1t2 · · · tn. An I-string and a sequence of transition symbols aresymbol-equivalent iff they become the same sequence by the following nor-malization:– Replace all I-length with + from an I-string, resulting a sequence of

symbols of Σ• ∪ Σ+. For example, I-string v3.8v0.4εm•w2.1 becomes asequence v+v+εm•w+.


– Replace successive occurrences of identical I-symbols, for instancev+v+, with one symbol v+ (and remove ε) from both sequences. Thisis similar to the I-normal form for an I-string(see Sect. 3.2).

For example, I-string v3.8v0.4εm•w2.1 is symbol-equivalent to a sequenceof transition symbols v+εm•w+w+, since both of them are normalizedto v+m•w+. As another example, a7b3b5m•a2 is symbol-equivalent toa+b+m•a+.

• The last state of p is one of the final states F .• For any subsequence qjtj+1qj+1 · · · tkqk, qj−1 = qj , qk+1 = qk and an I-

constraint (qj , qk, Qi, Λ), if qi−1 ∈ Qi, qk+1 ∈ Qi then I(tj+1 · · · tk) ∈ Λ.

6 I-String Recognition Algorithm

6.1 Recognition by Depth-First Path Enumeration

The state transition of conventional NFA is “discrete.” On the other hand,on I-NFA, in addition to discrete ones, I-length for each transition shouldbe taken into consideration. This means that the number of possible I-lengthpaths might be infinite. In order to make the path search tractable, we intro-duced an algorithm in which each input symbol is handled by state transitions,and the I-length is handled by linear inequality constraints. Starting from theinitial state q0 of an NFA, the recognition algorithm repeats state transitionsin the depth-first manner to enumerate symbol-equivalent paths, by read-ing each input symbol s1, s2, . . . , sm one by one. Each time the algorithmmakes a transition, the satisfiability of constraints on I-lengths are checked.If not satisfiable, the branch of symbol-equivalent path enumeration fails andbacktracks. When all the symbols are read, if the state comes to one of thefinal states F , then the input I-string is accepted. If the input I-string is notaccepted in any nondeterministic choice of transition, the input I-string isrejected.

In the algorithm, we use an extended I-path: some I-lengths can be left asvariables, and constraints on those variables may be written. We call this “I-path with constraints.” In the following, we assume that the input I-string isin a normal form. Thus, there is no possibility that the adjacent input symbolsare the same to each other.

I-String Recognition Algorithm

We assume the input I-string is s1s2 · · · sm (si ∈ Σ) in the following.

1. Initialization(a) Candidate path set: let the candidate path set be the ε-closure from

the initial state q0, then choose one of them as the current path p.(b) The symbol in focus: let the symbol in focus be si (i = 1).


(c) Constrained variables: let the current set of constraints be empty.Then, for each of I-symbol si, prepare a variable pi for representingthe length of that symbol, and add a constraint pi ≤ I(si). During thecourse of the execution of the algorithm, the algorithm adds anotherconstraint pi = I(si) after completing state transition for si, to makepi be exactly the same with the I-length of the symbol I(si).

2. Symbol-equivalent path enumeration: for the symbol si, pick one possibletransition from the end of last p which is not yet tried.(a) If there remains no such transition, then backtrack.(b) If the new state is already visited and the solution of the current

constraints is presumed by the previous constraints, then backtrack(See Sect. 6.2).

3. I-length assignment for each transition: if the transition symbol of thetransition is an I-symbol, create a variable xj associated with the tran-sition for representing the possible I-length. Add a constraint that thevariable is positive, and is equal to or less than the I-length of the inputsymbol: 0 < xj ≤ I(si). Also, update the constraint for pi to accumulatethe I-length for symbol si so that: pi = (former pi) + xj .

4. I-constraint handling:(a) For each I-constraint γk , the I-length expression associated with the

I-constraint is increased by the variable xj newly introduced for thetransition. If there is no solution to the constraints, then backtrack.

(b) If the new state goes into an I-constraint γk, then a variable z(1)k

for the I-length inside γk is generated. If the same I-constraint isrevisited again, new variable, say z

(2)k would be generated. Then, a

constraint z(1)k < upperBound(γk) or z

(1)k ≤ upperBound(γk) is added

depending on the specification of the upper bound of γk. Check if thewhole constraint system can have any solution. If there is no solutionto the constraints, then backtrack.

(c) If the new state goes out of an I-constraint γk, then the constraintlowerBound(γk) < z

(1)k or lowerBound(γk) ≤ z

(1)k is added depending

on the specification of the upper bound of γk. Check if the wholeconstraint system can have any solution. If there is no solution to theconstraints, then backtrack.

5. Advance the focus to the next input symbol si+1,(a) Acceptance/rejection check: if the new state is a final state, and the

input symbols are all read, and the constraints have a solution, thenterminate the execution and accept the input I-string.

(b) If the previous symbol has been an I-symbol, add the constraint thatthe lower bound of the variable pi associated with the previous I-symbol be the I-length of the symbol: pi = I(si).

If there is no solution to the constraints, then backtrack.6. Repeat the algorithm.


If backtracking from the initial state occurs, there is no more choice, so theinput I-string is rejected.

6.2 Redundant Path Enumeration Cut-Off

The above algorithm enumerates paths, and checks whether each of them sat-isfy the constraints. At any point of path enumeration, the remaining behaviorfrom that point is determined by the state tuple of the algorithm:

(si, qj , C(pi, zh1 , zh2 , . . .)), (27)

where si is the symbol in focus, qj is the current state of the I-NFA,C(pi, zh1 , zh2 , . . .) is a set of constraints on pi (I-length assigned so far for si),and zh1 , zh2 , . . . (I-lengths assigned so far for currently open I-constraints).Since these I-constraints are open, qj should be inside of those ones. Thesearch in the algorithm will be redundant if the same or subsumed states (27)appear more than once.

The simplest example is Fig. 9a. Suppose that the algorithm is processingthe second symbol s2 = v7.2 of I-normal form of an I-string w4v7.2, andthe transition for s2 started from state q1. When the algorithm comes toq1 → q2 → q3, the state tuple of the algorithm will be:

(s2, q3, {“0 < p2 ≤ I(v7.2) = 7.2”}) (28)

0 < p2 is implied by the definition of I-length of I-symbol, while p2 ≤I(v7.2) = 7.2 is from the I-length of s2. No zhk

appears in the constraints be-cause there is no I-constraint appearing in this I-NFA. When the algorithmadvances to the point of q1 → q2 → q3 → q2 → q3, the state tuple of thealgorithm would be the same with (28). So, the rest of the search from thelatter path will be cut-off since it is redundant.

For I-NFA shown in Fig. 9b, the constraint on p2 seems to change forever,like:

“1 ≤ p2 ≤ 4”, “2 ≤ p2 ≤ 8”, “3 ≤ p2 ≤ 12”, . . . . (29)

But, by taking the upper bound of I(v7.2) into the consideration, the algo-rithm would proceed as follows:

q1V+q2 q3 q4

(a)

V+

[1, 4]

q1 q2 q3 q4

(b)

Fig. 9. Examples of redundant path enumeration cut-off (1)


q1

V+ V+

V+V+

[1, 4]

[2, 3] (2, 5]

(3, 4]

q2 q3

q5

q6

q7 q8

q9 q10

q11

q4

[5, 17)

Fig. 10. Example of redundant path enumeration cut-off (2)

q1 → q2 → q3 : (s2, q3, {“1 ≤ p2 ≤ 4”}),q1 → q2 → q3 → q2 → q3 : (s2, q3, {“2 ≤ p2 ≤ I(v7.2) = 7.2”}),q1 → q2 → q3 → q2 → q3 → q2 → q3 : (s2, q3, {“3 ≤ p2 ≤ I(v7.2) = 7.2”}).

(30)Since the range “3 ≤ p2 ≤ I(v7.2) = 7.2” is subsumed by the previous one“2 ≤ p2 ≤ I(v7.2) = 7.2”, the rest of the search will be cut-off.

If state qj is within a still open I-constraint, the constraints on I-lengthfor each I-constraint should be taken into account, in addition to constraintson pi. This situation is illustrated by comparing two paths of the I-NFA inFig. 10. Both the upper path

q1 → q2 → q3 → q6 → q7 → q8 → q11 (31)

and the lower path

q1 → q4 → q5 → q6 → q9 → q10 → q11 (32)

have the same end state q11 which is in the still-open, outer I-constraint. Letx2, x4, x7, x9 be I-lengths assigned for q2 → q3, q4 → q5, q7 → q8, q9 → q10,respectively, and z be I-length assigned so far for the outer I-constraint. Theset of constraints for the upper path would be

1 ≤ x2 ≤ 4, 3 < x7 ≤ 4, 0 < p2 ≤ I(s2), p2 = x2 + x7, z < 17, z = x7 (33)

while that for the lower path would be

2 ≤ x4 ≤ 3, 2 < x9 ≤ 5, 0 < p2 ≤ I(s2), p2 = x4+x9, z < 17, z = x9 . (34)

These two paths are not regarded as redundant since the solution of (33) and(34) with respect to (p2, z) is not the same. Notice that if these constraintsare solved just with respect to p2, both solution will be 4 < p2 ≤ I(s2) = 7.2.

The lower bound of the outer I-constraint 5 ≤ z is not included becausethe I-constraint is still open. This constraint will be added when the pathgoes out from it. Similarly, the constraint p2 = I(s2) will be added whenthe algorithm tries to proceed to the next symbol s3, that is, terminating theprocess for s2 at state q11.


Fig. 11. I-automaton for I-regular expression r

6.3 I-String Recognition Example

Now let us see the matching process for an I-regular expression against theI-string as shown below. Let an I-regular expression be

v+ ( v+

(0, 30]g• v+

(0, 10]) (v+ | g•)∗ ,

and an I-string be v50 g• v100 g• v200. First, we translate I-regular expressionto I-automaton shown in Fig. 11 (illustrated using simplified diagram). In thisI-automaton, some ε transitions are omitted for simplicity and the transitionby the symbol v+ with interval constraint (0, 30] is shown as if it were thetransition by the symbol v(0,30].

The I-string recognition algorithm works for the input I-string as shownin Table 4. In the table, the execution advances from top to bottom. A symbolin focus is shown in the leftmost column. The chosen transition is shown inthe center column. Some transitions are combined into a single row in orderto save space. The variable over a transition arrow→ represents the I-lengthassociated with the transition. “fail” shows that there is no solution for theconstraints or no candidate for transition. Constraints are shown in the right-hand-side column. A new constraint is shown in the center of the column. Forreferring to the constraint, the number in the left of the column is used. Allthe constraints effective at the point of execution are shown by numbers inthe right of the column.

7 Conclusion

In this chapter, we modeled continuous media as I-string. As a pattern spec-ification language on the I-string we introduced I-regular expression, and asthe machine which recognizes the language, we introduced I-automaton.

The I-regular expression provides a relatively simple but expressive meansto specify patterns on mixture of continuous/discrete notions. The continuousnotion (time duration) is handled by the constraint system, and the discretenotion (D-symbol and symbol character) is handled by the state machine. Thepreference on I-length enables the user control the matching preference.

The limitation of the model is that for each positive interval, multipleattributes cannot be associated to the media explicitly. For example, if the


Table 4. The search process for an I-automaton

Input I-string Path Constraintss1 s2 s3 s4 s5 (the initial state) i pj ≡ I(pj)v50 g• v100 g• v200 (xk > 0, k = 1, 2, . . .)

v50 p1 = ix1→ 1 (1) p1 = x1 = 50 {1}

g• p2 · · · no candidate : fail (2) φ

v50 p1 = ix2→ 1

x3→ 2 (3) p1 = x2 + x3 = 50 {3}0 < x3 ≤ 30

g• p2 = 2→3 (4) {3}v100 p3 = 3

x4→ 4 (5) p3 = x4 = 100 {3}0 < x4 ≤ 10

· · · no solution : fail

v100 p3 = 3x5→ 4→f (6) p3 = x5 = 100 {3}

0 < x5 ≤ 10· · · no solution : fail

v100 p3 = 3x6→ 4→5 (7) p3 = x6 = 100 {3}

0 < x6 ≤ 10· · · no solution : fail

v100 p3 = 3x7→ 4→5

x8→ 6 (8) p3 = x7 + x8 = 100 {3, 8}0 < x7 ≤ 10

g• p4 · · · no candidate : fail (9) {3}v100 p3 = 3

x9→ 4→5x10→ 6→f (10) p3 = x9 + x10 = 100 {3, 10}

0 < x9 ≤ 10g• p4 · · · no candidate : fail (11) {3}

v100 p3 = 3x11→ 4→5

x12→ 6→5 (12) p3 = x11 + x12 = 100 {3, 12}0 < x11 ≤ 10

g• p4 = 5→6 (13) {3, 12}v200 p5 · · · no candidate : fail (14) {3, 12}

g• p4 = 5→6→f (15) {3, 12}v200 p5 · · · no candidate : fail (16) {3, 12}

g• p4 = 5→6→5 (17) {3, 12}v200 p5 = 5

x13→ 6 (18) p5 = x13 = 200 {3, 12, 18}no more symbol no final state : fail (19) {3, 12}

v200 p5 = 5x14→ 6→f (20) p5 = x14 = 200 {3, 12, 20}

no more symbol final state : success (21) {3, 12, 20}Resulting path at success

v50 p1 = ix2→ 1

x3→ 2 (3) p1 = x2 + x3 = 500 < x3 ≤ 30

g• p2 = 2→3 (4)

v100 p3 = 3x11→ 4→5

x12→ 6→5 (12) p3 = x11 + x12 = 1000 < x11 ≤ 10

g• p4 = 5→6→5 (17)

v200 p5 = 5x14→ 6→f (20) p5 = x14 = 200


first 15 min is “drama” and “classic,” then we associate it with an attribute“classic drama.” The ‘classic drama’ consists of “classic” and “drama,” andsuch an algebraic structure of attributes are discussed in [3,5,7] already, andomitted from our chapter.

References

1. James F. Allen. Maintaining knowledge about temporal intervals. Communica-tions of the ACM, 26(11):832–843, 1983.

2. John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory,Languages, and Computation. Addison-Wesley, Reading, MA, 1979.

3. H. Khalfallah and A. Karmouch. An architecture and a data model for integratedmultimedia documents and presentational approach. ACM Multimedia Systems,3(516):238–250, 1995.

4. Ken Nakayama, Kazunori Yamaguchi, and Satoru Kawai. I-regular expression:regular expression with continuous interval constraints. In CIKM ’97: Proceedingsof the sixth international conference on Information and knowledge management,pages 40–50, New York, NY, USA, 1997. ACM.

5. E. Oomoto and K. Tanaka. OVID: Design and implementation of a video-object database system. IEEE Transactions on Knowledge and Data Engineering,5(4):629–643, 1993.

6. D. Perrin. Finite automata. In J. van Leeuwen, editor, Handbook of TheoreticalComputer Science, Volume B: Formal Models and Semantics, pages 1–57. TheMIT/Elsevier, Cambridge, MA/Amsterdam, 1990.

7. R. Weiss, A. Duda, and D. K. Gifford. Composition and search with a videoalgebra. IEEE Multimedia, 2(1):12–25, 1995.

8. Gerhard A. Schloss and Michael J. Wynblatt. Providing definition and temporalstructure for multimedia data. ACM Multimedia Systems, 3(516):264–277, 1995.

9. Setrag Khoshafian and A. Brad Baker. MultiMedia and imaging databases, chap-ter 7.2, pages 333–338. Morgan Kaufman, San Francisco, 1996.

Part III

Computational Intelligence in Image/AudioProcessing

Noise Filtering of New Motion CaptureMarkers Using Modified K-Means

J.C. Barca, G. Rumantir, and R. Li

Department of Information Technology, Monash University, Melbourne, [email protected],

[email protected], [email protected]

Summary. In this report a detailed description of a new set of multicolorIlluminated Contour-Based Markers, to be used for optical motion capture anda modified K-means algorithm, that can be used for filtering out noise in motioncapture data are presented. The new markers provide solutions to central prob-lems with current standard spherical flashing LED based markers. The modifiedK-means algorithm that can be used for removing noise in optical motion capturedata, is guided by constraints on the compactness and number of data points percluster. Experiments on the presented algorithm and findings in literature indicatethat this noise removing algorithm outperforms standard filtering algorithms suchas Mean and Median because it is capable of completely removing noise with bothSpike and Gaussian characteristics. The cleaned motion data can be used for accu-rate reconstruction of captured movements, which in turn can be compared to idealmodels such that ways of improving physical performance can be identified.

1 Introduction

This report is a part of a body of research that aims to develop an automatedintelligent personal assistant, which can facilitate classification of complexmovements and assist in goal-related movement enhancement endeavors. Theoverall research is divided into two major phases. The first of these two phasesaim to develop a personal assistant that will support athletes with improvingtheir physical performance. To construct this personal assistant a new cost-effective motion capture system, which overcomes the limitations in existingsystems and techniques that support intelligent motion capture recognition,must be developed. Phase two of the overall research focus on developinga physical prototype of the Multipresence system suggested by the authorin [1, 2]. This Multipresence system will be constructed in a way that allowsthe personal assistant to control it using intelligent motion capture recognitiontechniques.

J.C. Barca et al.: Noise Filtering of New Motion Capture Markers Using Modified K-Means,



168 J.C. Barca et al.

The report will focus on the first phase of the overall research and tocomplete this phase a number of areas must be investigated. These areasinclude:

1. Camera and volume calibration2. Construction of a new marker system, which does not have the limitations

associated with Classical spherical flashing LED based markers.3. Motion data capturing and pre-processing4. Noise filtering5. Marker centre point estimation6. 2D to 3D conversion of marker coordinates7. Construction, fitting and temporal updating of skeleton8. Development of an intelligent motion recognition system

A brief overview of general motion capture techniques is provided first,with the focus being on marker based optical motion capture. Proposed solu-tions to point two, three and four from the above list, will then be explainedin detail. As a response to point two, a new set of multicolor IlluminatedContour-Based Markers [3] is presented. A dimensionality reduction proce-dure, which simplifies the captured motion data, such that further processingbecomes less complex, is then proposed as a solution to step three. Finally amodified K-means algorithm [4], which can be used for inter-frame noise re-duction in images with optical motion capture data, is presented as a solutionto step four.

1.1 Motion Capture

Motion capture systems are tools for accurately capturing complex real worldmovements. Typically these captured movements are used in the movie, ani-mation and games industries where high quality representations of movementsare required in order to support suspension of disbelief. More recently motioncapture has also been used as a tool to aid in human motion analysis. Resultsfrom this kind of analysis can be used to identify ambiguities with the physicalperformance of athletes, or to assist in diagnosing people with illnesses thataffect their movement [5]. Some research also indicate that motion capturecan be used for controlling humanoid robots [6].

There is a range of different motion capture technologies available. Thesetechnologies span from optical, magnetic, mechanical, structured light, radiofrequency and acoustic systems to wearable resistive strips and inertial sens-ing systems, or combinations of the above [7]. All these technologies havevarying degrees of drawbacks. Optical, acoustic and structured light systemssuffer from occlusion problems, magnetic and radio frequency trackers sufferfrom noise and echo problems, mechanical systems have a non-user friendlyinterface that undermines emersion, inertial sensors suffer from bias and drifterrors, while resistive strips must be built into a body suit, which makes them

Noise Filtering of New Motion Capture Markers Using Modified K-Means 169

difficult to calibrate for different users [7–10]. Another drawback with many ofthe abovementioned systems is also that they are high-end and therefore quiteexpensive, which makes it hard for many individuals and small companies toacquire the necessary technology [5, 11].

The optical approach to motion capture has been selected for this research.The reason for this is that the intension is to capture movements in controlledenvironments and the occlusion problems usually associated with the opticalapproach therefore will be limited. Other reasons for choosing this approachare that this class of systems has proved to support accurate capturing, haveonly limited noise problems, do not suffer from echo problems and there iscost effective ways to construct these systems. Optical systems can also easilybe designed in a way that does not limit the user’s freedom of movement.Another important factor for selecting the optical approach is that capturingcan be performed in real-time.

1.2 Optical Motion Capture

What systems that use the optical approach to motion capture have in com-mon is that they use cameras as sensors. In general, this class of systemscan be divided into two sub categories. These two categories are referred toas marker-less and marker-based approaches to optical motion capture. Thisresearch will focus on marker-based approaches, because currently only thesecan track complex and detailed motions effectively enough to support real-time processing [11].

1.3 Marker-Based Tracking

In early motion capture systems, most contour points of the tracked subjectwere suppressed in order to achieve real-time processing. The points thatwhere not suppressed where referred to as markers [12]. Today, in order toqualify as a marker, an object must contain two pieces of information: what theobject is in relation to the current process and where this object is located [13].Currently there are two main types of markers: Passive and Active. Both thesemarker types will be described briefly below.

Passive Markers

The characteristics of passive marker systems are that the markers must bemanually identified. A Classical passive system is constructed of spheres thatare 2.5 cm in diameter and are covered with a highly reflective material thatoften is over two thousand times brighter than a normal white surface [14]. Thematerial covering the marker reflects light (in many cases infrared) projectedfrom light sources positioned around the lens of each camera. These reflections


Fig. 1. A classical spherical marker [15]

give the markers a distinctive color compared to the rest of the image andtherefore support marker extraction. A Classical passive marker is shown inFig. 1.

The main drawback with passive systems is that either a trained humanoperator or a specific start-up pose of the performer is required for identifyingthe makers. A second drawback is that even if all markers have been correctlyidentified initially, their ID will be lost after an occlusion. As a result of this, itseems like a new unknown marker emerge when a occluded marker reappears[16]. These occlusions can in addition to contributing to the generation of falsemarkers, create holes in incoming data streams [17–20].

Active Markers

What active marker systems have in common, is that they express sufficientinformation to support automatic marker identification. There are severalvariations of active marker systems such as the square markers presented by[21] and the retro reflective mesh based markers presented by [6], but the mostcommonly used active marker is constructed of sets of spherical flashing lightemitting diodes (LED’s). Each of the LED’s in these commonly used markersare wired to an external computer, which provides them with distinctive flashsequences that allows each marker to communicate their ID automatically.The computer also ensures that the markers “flash” in synchronization withthe digital shutters of the capturing cameras [16,22,23].

A drawback with Classical spherical LED based active markers is thatmore than one image must be analyzed in order to identify each marker,which makes the processing time longer than if methods that support moredirect identification was used. One such direct method is to use static colors toexpress ID’s rather than flash sequences. The problem here is that colors tendto change when they are exposed to different lightning [24]. Knowledge aboutthe motion of tracked markers has therefore been used to support the color cue,but there are difficulties associated with this approach as well. This because ofsevere discontinuities in human motion and delay in frame processing [25,26].


A second problem with the flashing LED type active markers is that thewires that run from markers to the computer restrict the user’s freedom ofmovement [22, 23]. The result is that captured movement in some situationscan appear un-natural and that the tracking process may be too cumbersomefor use in some applications, especially in medical applications where usersmay have some kind of movement disability. Both the initial and the latterare highly undesirable. The initial because a tracking system that in any waymakes the movement appear un-natural undermines one of the central aimsof motion capture, which is to capture realistic movement (this is also thedrawback with the constraints posed on the users of the markers presentedby [24]). The latter is undesirable because a system design that makes thetracking process cumbersome prevents a range of people from benefiting fromthe technology. A third drawback with using flashing spherical LED typemarkers is as with spherical passive markers, that they easily create holes inincoming data streams as results of occlusions.

1.4 Proposed Solution to Drawbacks with Classical Markers

To solve and/or reduce the abovementioned drawbacks with current markersystems, the researcher propose the set of active multicolor IlluminatedSegment-Based Markers described by the author in [3]. These markers ex-press their identity using different pairs of long intersecting line segments withinternally produced static colors. These colors are illuminated into the envi-ronment and are therefore more robust towards changes in external lightingthan colors produced by reflected light. This way of solving the identificationproblem gives the markers an edge over Classical spherical LED based activemarkers. This because static color cues allow markers to be identified withinone single image, rather than trough a sequence of images and therefore al-lows for a reduction of processing time. The use of static colors also eliminatesthe need for wiring markers to a complex external computer, removing therestrictions usually posed on user movement by Classical flashing LED basedmarker systems. Another central strength of the Illuminated Segment-BasedMarkers is that they support more robust estimation of missing data, thantraditional markers. This because the proposed markers allow for both intra-frame interpolation of missing data and inter-frame estimation of occludedintermediate sections of line segments. This strength is highlighted by the factthat the Illuminated Segment-Based Markers are designed to be larger thantraditional markers and therefore have a greater chance of retaining enoughdata to estimate missing marker positions inter-frame than Classical mark-ers. This in turn results in a reduced chance of having to assume intra-framelinearity in the case of occlusions.

Design specifics and results from experiments on the Illuminated Segment-Based Markers are described in greater length in Sects. 2 and 3.


1.5 Characteristics of Optical Motion Capture Data

High dimensionality and noise is naturally embedded in time series data andmakes it a challenging task to process sequences of motion data [27]. To solvethis problem in an effective way, initial processing should involve a dimen-sionality reduction procedure, which simplifies the data. Such reductions aretypically performed by flattening regions where data only varies gradually,or not at all [28–30]. Noise can in general be referred to as any entity thatis uninteresting for achieving the main goal of the computation [31]. Theseuninteresting entities can be introduced to optical motion data as a result ofthe constant fluctuation of light, interference of background objects, externalartifacts that corrupts the analogue-to-digital conversion process, accuracylimitations of sensors or transmission errors [7,8,31]. It is important to noticethat some types of noise may be invisible initially, but can be accumulatedover time, resulting in increased data complexity and/or data being incor-rectly classified [7, 28, 32, 33]. To avoid this, one should aim to exclude asmuch noise from the data as possible before main processing is initiated. Toremove noise most effectively, one should investigate where it originates fromand analyze its characteristics so that knowledge obtained from this process,can be used for designing an suitable filtering algorithm for the noise at hand.

2 Experimental Design

In this section, we will describe the strengths of the Illuminated Contour-Based Marker System and explain how these are assembled. Then a descrip-tion of the nature of the captured data and an outline of how data is capturedand pre-processed is provided. At the end of the section, we present a detaileddesign overview of the proposed Modified K-means algorithm, which is usedfor removing inter-frame noise in optical motion capture data.

2.1 The Illuminated Contour-Based Marker System

The Illuminated Contour-Based Markers are constructed of intersecting pairsof 3mm thick battery powered, flexible glow wires of different colors. Theseglow wires are made of copper wires with phosphorus powder coatings andare protected by insulating plastic in different colors. The wires operate onalternating currents using a small battery driven inverter. When a current istransmitted trough a wire the phosphorus produce an illuminating electrolu-minescent glow [34]. The appearance of this glow depends on the color of theinsulating plastic layer covering the wires. Ten different types of glow wire areavailable on the market to day. A glow wire can be observed in Fig. 2. Theglow wires are cut into appropriate lengths, and pairs of wires with differentcolors are assembled into markers in such a way that the two wires intersect


Fig. 2. Glow wire

Fig. 3. The Illuminated Contour-Based Markers. Each pair of line segments illumi-nates a set of distinctive colours

and each marker is identifiable by its distinctive color combination. The in-tersection between wires is regarded as being the marker midpoint. Sets ofIlluminated Contour-Based Markers are shown in Fig. 3.

2.2 The Body Suit

The assembled markers are attached to a body suit to be worn by the subjectto be tracked during the motion capture procedure. In order for this body suitto support realistic and accurate tracking it requires some essential charac-teristics. First, it must not restrict the user’s freedom of movement. Secondly,it is important that the material the bodysuit is constructed of is able toclosely follow the movement of the tracked body and stay in place as the skin


Fig. 4. A prototype of the bodysuit with Illuminated Contour-Based Markersattached

moves underneath [35]. After experimenting with different types of materialsand suit designs, the researcher found that tight sitting, lightweight thermalunderwear and socks have the above mentioned qualities

As the body suit needs to be washed after being used, the markers aredesigned to be temporarily attached to the suit using Velcro instead of beingpermanently attached. As such, strips of Velcro patches were glued to the suitat key positions so that the markers can be attached to them (how these keypositions is selected in described below).

In order to allow for adjustments of the suit so that it could be accom-modated for small variations in body size and shape, these patches of Velcrowhere made long enough to allow for fine tuning of marker positions. Thecomplete bodysuit can be observed in Fig. 4.

A small battery driven inverter that supplies the markers with electricity,is placed on the lower back region of the body suit. This location has beenselected as it attributes minimal interference with the user’s body movement.

2.3 Marker Placement

To support a motion capturing process with minimum interference of noise, itis important to identify positions on the tracked body, which are suitable formarker attachment. These key positions should allow the markers to remainin stable relationships with the underlying skeleton as the body moves. Onething that can affect this relationship is secondary motions in soft body tissue[14,36]. In order to avoid capturing these secondary movements, the researcher


Fig. 5. Virtual skeleton rigged with Illuminated Contour-Based Markers

has chosen to place the markers on areas of the body where the skin is closeto the bone (e.g. elbows, knees and wrist). Figure 5 shows a virtual skeletonrigged with a set of Illuminated Contour-Based Markers.

2.4 The Motion Capture Images

Series of images has been captured of an articulated human body rigged withthe Illuminated Contour-Based Markers. These images have an identical sizeof 720×576 pixels and the color space used is RGB. All images where capturedusing four different calibrated cameras, placed in a circle around the captur-ing volume. More colors appear in the captured images, than those used inthe original Illuminated Contour-Based Markers System as a result of smalldifferences in sensing devises within and across these cameras. This resultsin an excessive image complexity, which contributes to increasing process-ing time. To solve this problem each image is pre-processed (as explained inSect. 2.5).

As image features change over time and across capturing devices and toensure that the proposed system is able to process all features correctly, imagesused in experiments have been selected randomly across both cameras andtime steps.

2.5 Data Pre-Processing

To reduce the complexity of captured images, un-interesting image compo-nents are filtered out as background in pre-processing using a thresholdingtechnique. Data that is valid for the main processing is compressed into anumber of flat colour regions, corresponding to the number of colours usedin the marker system. Tolerance values for each of these regions have beendetermined through multiple trial and error experiments.


2.6 Modified K-Means Algorithm for Noise Filtering

When data has been pre-processed, the Modified K-means algorithm is used toclean up noise embedded in each image by creating clusters of pixels based ontheir relative spatial positions in the image. Following the classical K-meansalgorithm [27,28,37–43] the Euclidean Distance measure shown in (1), is usedto determine which cluster a pixel belongs to. Each pixel is put into a cluster,which yields the minimum Euclidean Distance between the pixel and therespective centroid. The centroid of each cluster is changed iteratively bycalculating its new coordinate as the average of the sum of the coordinates ofthe pixels in the cluster until it converges to a stable coordinate with a stableset of member pixels in the cluster. For each iteration, the memberships ofeach cluster keep changing depending on the result of the Euclidean Distancecalculation of each pixel against the new centroid coordinates

dic =√

(xi − xc)2 + (yi − yc)2, (1)

where:

dic : the Euclidean distance between pixel i and a centroid cxi, yi : the 2D coordinate of pixel ixc, yc : the 2D coordinate of centroid c

The modifications to the classical K-means algorithm lie in the definitionof a data vis-a-vis noise cluster and the automation of the determination ofthe optimum number of clusters an image should have. A cluster is consid-ered noise if it only has a few pixels in it. The minimum number of pixels in acluster, or the cluster size, should be set such that it minimizes the degree offalse positives (i.e. data clusters incorrectly classified as noise) and false neg-atives (i.e. noise clusters incorrectly classified as data). The minimum clustersize is domain specific and is determined by observing the number of datapoints usually found in a noise cluster for the type of data at hand. In thisexperiment, the minimum number of pixels in a cluster is set to 4 after a fewtrial and error processes.

The compactness of a cluster is used to determine the optimum number ofclusters for a given image. In this paper, the degree of compactness of a clusteris defined as the number of pixels occupying the region of a rectangle formedby the pixels located at the outer most positions of the cluster (i.e. the pixelsthat have the maximum and minimum X and Y coordinates respectively). Acluster that has a lower degree of compactness than the specified value willbe split further. In this experiment, the degree of compactness used is 20%,which is a value just below the minimum compactness of valid data clustersfor the observed domain.

The modified K-means algorithm performs local search using randomlygenerated initial centroid positions. It is a known problem that the determi-nation of the initial centroid positions plays a big part in the resulting clusters


and their compositions [29,38,44–47]. In order to reduce this problem and tomake the search mechanism a bit more exhaustive, ten clustering exercises us-ing ten different initial centroid positions are performed for each image. Theresult of the exercise that produces clusters with the maximum total degreeof compactness will be selected. If a set of data cannot be separated linearlywe discard the run and initiate the algorithm again with different initial clus-ter positions. The processed data is finally plotted, in order to allow for easyinspection of results.

A detailed overview of the Modified K-means algorithm is presented inTable 1.

Table 1. Modified K-means algorithm for noise reduction in optical motioncapture data

Procedure: modified K-means algorithm for noise reduction in optical motioncapture data

Set minimum number of data points per cluster // cluster size constraintSet minimum cluster compactness // cluster compactness constraint

For a set number of experiments do

Set initial cluster centroids

Set interationFlag to yesWhile iterationFlag = yes do

Set iterationFlag to no

// Basic K-meansRepeat

Calculate the distance between data points and each cluster centroidsAssign each data point to clusterCalculate the new cluster centroids

Until all clusters have converged

// Filter clusters based on minimum cluster size constraintFor each cluster

If cluster has too few data points thenDelete cluster

End ifEnd For

// Filter clusters based on cluster compactness constraint

For each cluster// Find corners of compactness window

(Continued)


Table 1. (Continued)

Find data points with minimum and maximum X valuesFind data points with minimum and maximum Y valuesDefine cluster compactness window size

Calculate the number of data points in clusterCalculate cluster compactness = number of data points / compactnesswindow size

If cluster compactness < minimum compactness thenSplit cluster into twoSet iterationFlag to yes

ElseRecord cluster compactnessRemove cluster and content from analysis

End ifEnd For

If iterationFlag = no thenCalculate the average compactness of all clusters in the experiment

End ifEnd while

End For

Select set of clusters from experiment with the highest average compactnessEnd Procedure

3 Experiment Results

In this section we present results of experiments on pre-processing and intra-frame noise filtering in images captured from an articulated human bodyrigged with sets of Illuminated Contour-Based Markers.

3.1 Recognizing Coloured Line Segments

At present we have separated five of the ten different types of glow wiresavailable on the market into distinct flat color regions in pre-processing, al-lowing ten different markers to be constructed. These recognized wires areclassified as: Red, Orange, Green, Purple and Blue. Each of the remainingfive wires appears to have color attributes, which are so similar to a numberof the remaining nine, that they are hard to separate from the others. Theseparation problem is a result of sensing devices across cameras being slightlydifferent because this makes it necessary to employ an un-naturally wide color


threshold for each color, in order to support successful classification acrosscameras. This in turn makes the color space pre-maturely crowded leaving noroom for the remaining five unclassified line segments.

3.2 Noise Filtering

Five types of experiments have been performed on the Modified K-means al-gorithm. The first experiment tests the algorithms ability to remove syntheticspike noise from raw motion capture images. The second aims to find the al-gorithms tolerated spike noise level. This is done by introducing images withdifferent levels of real spike noise to the algorithm and analyzing the output.The third, tests how well the algorithm deals with real noise that has differentGaussian blur radii. This experiment is conducted in order to estimate the al-gorithms ability to remove noise with different Gaussian characteristics. Thefourth type of experiment is a set of comparisons between a commerciallyavailable Median filter [48], which is used for reducing noise in images andthe proposed modified K-means algorithm [4]. Finally it is shown that theproposed modified K-means algorithm also can be used to remove noise inimages with Classical spherical markers.

Removing Synthetic and Real Spike Noise

In the first experiment an image with spurious artificial spike noise has beencleaned. The result of this experiment can be observed in Fig. 6, where thenoisy image is represented in the top (noisy pixels are encircled) and thecleaned version in the bottom. Here the white pixels represent the backgroundwhile the black pixels represent the components of the Illuminated Contour-Based Markers System and noise.

Three of the images used in the second experiment, which involves findingthe Modified K-means algorithms spike noise level tolerance is shown in Fig. 7.Here the leftmost image has 0%, the middle 8% and the rightmost 16% realspike noise (image contrast is increased in order to allow for easy inspection).

Figure 8 shows the results of the experiment on real spike noise. Thenumber of cleaned data points is displayed vertically, while the noise level isdisplayed horizontally in percentage. One can here observe that more thanfifty percent of the original data points still are classified correctly at a noiselevel of 8%, while the algorithm still proved to effectively remove noise inimages with noise levels up to 12%.

Removing Gaussian Noise

In this experiment, Gaussian blur with varying radii is introduced to severalcopies of the noisy image in the top of Fig. 6, before the Modified K-meansalgorithm is used to clean the images. In Fig. 9, three of the processed imagesare presented (the leftmost image has a Gaussian blur pixel radius of 0, themiddle a radius of 2, and the rightmost 4).


Fig. 6. Top: A pre-processed motion capture image and noise in the form of irregularlighting can be observed. Bottom: The resulting cleaned image with noise removed

Fig. 7. Images with Illuminated Contour-Based Markers and Spike noise of 0%, 8%and 16%

Figure 10 shows how much data that can be recaptured after noise withGaussian characteristics has been removed. One can here observe that thenumber of data points recaptured naturally decreases as the radius of theGaussian blur increases. However, it is also shown that the degradation ofperformance occurs gradually, as oppose to abruptly when the radius is in-creased up to 2.5 pixels. For this reason, it can be concluded that the modified


Fig. 8. Results from experiment on images with Illuminated Contour-Based Markersand Spike noise of 0, 4, 8 and 12%

Fig. 9. Flattened images with Gaussian blur of 0, 2 and 4 pixels in radius beforenoise is removed

Fig. 10. Cleaned data points recaptured after the removal of Gaussian blur noisewith varying radii using the Modified K-means

K-means is capable of removing noise with Gaussian characteristics whilekeeping false positives to the minimum. This result is better than the perfor-mance of the Mean and Median filters that are well known to only suppress(i.e. reduce) Gaussian noise rather than remove it [31].


3.3 Comparisons: Modified K-Means vs. Median Filter

Two types of comparisons have been conducted and both of these have beenbetween a commercially available Median filter [48] that is used for reducingnoise in images and the proposed modified K-means algorithm [4].

Spike Noise Removal Comparisons

In Fig. 11 one can observe results of an experiment where the two algorithmsability to remove spike noise is analyzed. The level of Spike noise is incremen-tally increased with 4% across four runs, starting at 0. The ideal number ofdata points after noise filtering is 747. All data is initially pre-processed. Onecan observe that the number of recaptured data points is lower for the Me-dian filter in all test runs. This indicates that the modified K-means algorithmremoves spike noise with a lower number of false positives than the Medianfilter. This indication is verified in Fig. 12, where the number of false positives

Fig. 11. Recaptured data after Spike noise filtering

Fig. 12. Number of false positives in Spike noise experiments


across the same four runs in presented. One can observe that there are strongcorrelations between the increasing number of false positives and the level ofSpike noise. The number of false negatives was at zero across all runs.

Gaussian Noise Removal Comparisons

In Fig. 13 results from an experiment on a series of motion capture images withnoise and increasing levels of Gaussian blur is presented. The Gaussian blurpixel radius is increased incrementally with 0, 5 pixel across 8 runs, startingat 0 pixel radius. One can observe that there are close correlations betweenthe performance of the modified K-means algorithm and the Median filter asthe blur levels increase. One can also observe that the number of correctlyrecaptured clean data points decrease gradually as the Gaussian blur radiusincrease.

Figure 14 show how the number of false positives increase as the Gaussianblur pixel radius becomes greater. One can observe that there are strongcorrelations between results from the modified K-means algorithm and theMedian filter also here. The number false positives is here, still below fiftypercent of the total number of data points when the Gaussian blur pixelradius is at 2 pixels.

In Fig. 15 one can observe the number of false negatives in the same ex-periments as above. One can observe that the number of false negatives peakat 0.5 Gaussian blur pixel radius for both the Median filter and the modifiedK-means algorithm. This peak is at the same point where the number of falsepositives is at its lowest.

Fig. 13. Number of recaptured data points after images with noise and varyinglevels of Gaussian blur have been cleaned


Fig. 14. Number of false positives as the Gaussian blur radius increases

Fig. 15. Number of false negatives as the level of Gaussian blur increases

3.4 Removing Noise in Images with Spherical Markers

The Modified K-means algorithm has also been tested on images with syn-thetic Classical ball style markers, these experiments show that the proposedalgorithm also is capable of cleaning this type of data. An illustration of onethe results are given in Fig. 16, where the original image is presented to theleft and the processed image to the right.

3.5 Processing Time

It is important to notice that processing time increases with each additionalcluster centroid needed to analyze a dataset. Experiments show that if thelevel of noise is at 16% and above (this number is dependent on the colorcomposition of the noise at hand and the threshold values set for each markercomponent in pre-segmentation), the calculation time becomes so great (whenusing one Pentium 4 processor) that the noise cleaning becomes impractical.


Fig. 16. Left: A raw image generated from a synthetic ball marker. Right: Theimage with noise removed

This problem can be dealt with in three ways. The first is to ensure that cap-turing sensors and tools used for data transfer support lowest possible inter-ference of noise. The second method, which only partially solves the problem,is to increase the value for the minimum number of data points per clusterconstraint, such that more noisy data points can be removed from the datasetusing a smaller number of cluster centroids. Here, it is important to noticethat when the constraint value becomes greater than the number of datapoints usually clustered together in valid data, the number of false positiveswill increase. The third method for solving the problem would be to increaseprocessing power.

4 Conclusion

A set of Illuminated Contour-Based Markers for optical motion capture hasbeen presented along with a modified K-means algorithm that can be usedfor removing inter-frame noise. The new markers appear to have features thatsolve and/or reduce several of the drawbacks associated with other markersystems currently available for optical motion capture. Some of these featuresare:

• Missing data can be estimated both inter-frame and intra-frame, whichreduces the chances of complete marker occlusions without increasing thenumber of cameras used.

• System is robust toward changes in external lighting compared to markersthat do not produce its own internal light.

• Markers can be automatically identified in one single image.• Eliminates the need for synchronizing camera shutters with flashing from

markers and therefore allows for tracking without wiring the markers to acomplex computer.

• Has the potential to generate more markers than systems, which use onlyone single color for marker identification.

In the modified K-means algorithm, the modifications to the ClassicalK-means algorithm are in the form of constraints on the compactness and


the number of data points per cluster. Here clusters with a small numberof data points are regarded as noise, while sparse clusters are split further.The value for the minimum number of data points per cluster constraint isdomain specific and is determined by observing the number of data pointsusually found in a noise cluster for the type of data at hand. The value forthe minimum compactness constraint should be set just below the minimumcompactness of valid data clusters for the domain. Several experiments havebeen conducted on the noise filtering algorithm and these show that flatteningthe images into six color regions in the data pre-processing stage assists furtherprocessing by reducing the number of dimensions the algorithm must cluster.Experiments also indicate that the modified K-means algorithm:

• Manage to clean artificial and real spike noise in motion capture imageswith Illuminated Contour-Based Markers or Classical spherical markerswhen the signal to noise ratio is up to 12%.

• Is capable of completely removing Gaussian noise with a gradually increasein false positives as the radius increases. This is a better result than thatproduced by traditional Median and Mean filters.

• Reduces Spike noise in images with Illuminated Contour-Based Markers ina way that results in less false positives than the Median filter is capable of.

• Reduces Gaussian blur in images with Illuminated Contour-Based Markerswith similar number of false positives as the Median filter.

5 Future Work

A suitable algorithm for automatic marker midpoint estimation is currentlybeing constructed. When a complete set of experiment have been conducted,future research will involve investigating a color calibration method, whichaims to synchronize the input from capturing cameras. This in order to al-low more markers with distinctive color combinations to be generated. Thiscalibration procedure will involve comparing the color values being registeredfor the same object across cameras. Trough the use of knowledge obtainedtrough these comparisons, a correction matrix that can be used for guidingthe synchronization of input from different cameras, can be generated. Thissynchronization process may in turn allow for smaller regions of the colorspace to be assigned for classification of each marker component, resulting ina less crowded color space. This optimized use of color space may make roomfor new distinctive regions within the color space, which can be used for clas-sifying more of the ten glow wires currently available on the market. It mayalso prove fruitful to research into the use of a color space that has a separatechannel for luminosity, (such as Absolute RGB or HSV) so that luminosityinformation can be removed from further analysis. The benefit would be thatthe color values registered for each glow wire would be more stable as the dis-tance between wires and cameras change. This may in turn allow for smaller


regions of the color space to be associated with each wire, allowing furtheroptimization of the color space separation.

When the above is completed, the research focus will be on investigatingmethods that allow for automatic 2D to 3D conversion of marker coordinates.This before focus is shifted onto researching and implementing techniques thatallows a virtual skeleton to be fitted to incoming motion data and tracked overtime. Finally, ideal motion models will be captured and the intelligent motionrecognition system designed, before the second major research phase, (whichinvolves constructing the Multipresence system described by the author in[1, 2]) is initiated.

References

1. Barca J C, Li R (2006) Augmenting the Human Entity through Man/MachineCollaboration. In: IEEE International Conference on Computational Cybernet-ics. Tallinn, pp 69–74

2. Barca J C, Rumantir G, Li R (2008) A Concept for Optimizing BehaviouralEffectiveness & Efficiency. In: Machado T, Patkai B, Rudas J (eds) IntelligentEngineering Systems and Computational Cybernetics. Berlin Heidelberg NewYork, Springer, pp 477–486

3. Barca J C, Rumantir G, Li R (2006) A New Illuminated Contour-Based MarkerSystem for Optical Motion Capture. In: IEEE Innovations in Information Tech-nology. Dubai, pp 1–5

4. Barca J C, Rumantir G (2007) A Modified K–means Algorithm for Noise Re-duction in Optical Motion Capture Data. In: 6th IEEE International Conferenceon Computer and Information Science. Melbourne, pp 118–122

5. Jobbagy A, Komjathi L, Furnee E, Harcos P (2000) Movement Analysis ofParkinsonians. In: 22nd Annual EMBS International Conference. Chicago, pp821–824

6. Tanie H, Yamane K, Nakamura Y (2005) High Marker Density Motion Captureby Retroreflective Mesh Suit. In: IEEE International Conference on Roboticsand Automation. Barcelona, pp 2884–2889

7. Bachmann E (2000) Inertial and Magnetic Tracking of Limb Segment Orien-tation for Inserting Humans into Synthetic Environments. PhD thesis, Navalpostgraduate school

8. Clarke A, Wang X (1998) Extracting High precision information from CCDimages. In: Optical Methods and Data Processing for Heat and Fluid Flow.City University, pp 1–11

9. Owen S (1999) A practical Approach to Motion Capture: Acclaim’s opticalmotion capture system. Retrieved Oct 2, 2005. Available at www.siggraph.org/education/materials/HyperGraph/animation/character animation/motioncapture/motion optical

10. Sabel J (1996) Optical 3d Motion Measurement. In: IEEE Instrumentation andMeasurement Technology. Brussels, pp 367–370

11. Oshita M (2006) Motion-Capture-Based Avatar Control Framework in Third-Person View Virtual Environments. In: ACM SIGCHI International Conferenceon Advantages in Computer Entertainment Technology ACE’06. New York


12. Furnee E (1988) Motion Analysis by TV-Based Coordinate Computing in RealTime. In: IEEE Engineering in Medicine and Biology Society’s 10th AnnualInternational Conference. p 656

13. Bogart J (2000) Motion Analysis Technologies. In: Pediatric Gait. A new Mil-lennium in Clinical Care and Motion Analysis Technology. pp 166–172

14. Shaid S, Tumer T, Guler C (2001) Marker Detection and Trajectory GenerationAlgorithms for a Multicamera based Gait Analysis System. In: Mechatronics11: 409–437

15. LeTournau University (2005) LeTournau University. Retrieved Nov 30, 2005.Available at www.letu.edu

16. Kirk G, O’Brien F, Forsyth A (2005) Skeletal Parameter Estimation from Opti-cal Motion Capture Data. In: IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR’05). pp 782–788

17. Brill F, Worthy M, Olson T (1995) Markers Elucidated and Applied in Local3-Space. In: International Symposium on Computer Vision. p 49

18. Wren C, Azarbayejani A, Darrel T, Pentland A (1997) Pfinder: Real-time Track-ing of the Human Body. In: IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 19: 780–785

19. Arizona State University (2006) Arizona State University. Retrieved Apr 27,2006. Available at www.asu.edu

20. Ringer M, Durmond T, Lasenby J (2001) Using Occlusions to aid PositionEstimation for Visual Motion Capture. In: IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR’01). pp 464–469

21. Kawano T, Ban Y, Uehara K (2003) A Coded Visual Marker for Video TrackingSystem Based on Structured Image Analysis. In: 2nd IEEE and ACM Interna-tional Symposium on Mixed and Augmented reality. Washington, p 262

22. Fioretti S, Leo T, Pisani E, Corradini L (1990) A Computer Aided MovementAnalysis System. In: IEEE Transaction on Biomedical Engineering 37: 812–891

23. Tekla P (1990) Biomechanically Engineered Athletes. In: IEEE Spectrum 27:43–44

24. Zhuang Y, Zhu Q, Pan Y (2000) Hierarchical Model Based Human MotionTracking. In: International Conference on Image Processing. Vancouver, pp86–89

25. Kang J, Cohen I, Medoni G (2003) Continuous Tracking Within and AcrossCamera Streams. In: IEEE Conference on Computer Vision and Pattern. Wis-counsin, pp 267–272

26. Sherrah J, Gong S (2000) Tracking Body Parts using Probabilistic Reasoning.In: 6th European Conference on Computer Vision. Dublin

27. Tatsunokuchi, Ishikawa, Minyang, Sichuan (2004) An evolutionary K-meansalgorithm for clustering time series data. In: 3rd International Conference onMachine Learning and Cybernetics. Shanghai, pp 1282–1287

28. Chen H, Kasilingam D (1999) K-Means Classification Filter for SpeckleRemoval in Radar Images. In: Geoscience and Remote Sensing Symposium.Hamburg, pp 1244–1246

29. Lee H, Younan H (2000) An Investigation into Unsupervised Clustering Tech-niques. In: IEEE SoutheastCon. Nashville, pp 124–130

30. Pham L (2002) Edge-adaptive Clustering for Unsupervised Image Segmenta-tion. In: International Conference on Image Processing. Vancouver, pp 816–819

31. Trucco E, Verri A (1998) Introductory Techniques for Computer Vision. NewJersey, Prentice Hall


32. Zheng K, Zhu Q, Zhuang Y, Pan Y (2001) Motion Processing in Tight-ClothingBased Motion Capture. In: Robot Vision. Auckland, pp 1–5

33. ZuWhan K (2001) Multi-View 3-D Object Description With Uncertain Rea-soning and Machine Learning. PhD Thesis, Faculty of the graduate school

34. Elec2go (2006) Elec2go. Retrieved July 30-update, 2006, Available at www.elec2go.com.au/index.htm

35. Vanier L, Kaczmarski H, Chong L, Blackburn B, Williams M, Velder A(2003) Connecting the Dots: The Dissection of a Live Optical Motion CaptureAnimation Dance Performance, Available at www.isl.uiuc.edu/Publications/final20dance1.pdf

36. Furnee E (1988) Speed, Precision and Resolution of a TV-Based Motion Analy-sis Computer. In: 10th IEEE Engineering in Medicine and Biology Society.p 656

37. Knaungo T, Netanyahu N, Wu A (2002) An Efficient K-Means Clustering Algo-rithm: Analysis and Implementation. IEEE Transactions on Pattern Analysisand Machine Intelligence. pp 881–892

38. Whitten I, Frank E (2005) Data Mining Practical Machine Learning Tools andTechniques second edition. San Fransisco, Morgan Kaufman Publishers

39. Jain K, Dubes E (1988) Algorithms for Clustering Data. Prentice Hall, NewJersey

40. Jain A, Murty M, Flynn P J (1999) Data Clustering: A Review. In: ACMComputing Surveys 31: 264–232

41. Kaufman L, Rosseeuw P (1990) Finding Groups in Data, an Introduction toCluster Analysis. New York, Wiely

42. Hasegawa S, Imai H, Inaba M, Katoh N, Nakano J (1993) Efficient Algorithmsfor Variance Based Clustering. In: 1st Pacific Conference on Computater Graph-ics Applications. Seoul, pp 75–89

43. Abche A, Tzanakos G, Tzanakou E (1992) A new Method for Multimodal 3–DImage Registration with External Markers. In: Medicine and Biology Society14: 1881–1882

44. Bacao F, Lobo V, Painho M (2005) Self-organizing Maps as Substitutes forK-Means Clustering. Berlin Heidelberg New York, Springer, pp 476–438

45. Chimphlee W, Abdullah A, Sap M, Chimphlee S, Srinoy S (2005) Unsuper-vised Clustering Methods for Identifying Rare Events in Anomaly Detection.In: Transactions on Engineering, Computing and Technology 8: 253–258

46. Milligan G W (1980) An Examination of the Effects of Six Types of ErrorPerturbation of Fifteen Clustering Algorithms. In: Psychometrika 45: 325–342

47. Su T, Dy J (2004) A Deterministic Method for Initialising K-means Clustering.In: 16th IEEE international Conference on Tools with Artificial Intelligence(ICTAI 2004). pp 784–786

48. DirectXtras.Inc (2003) DirectExtras. Retrieved Apr 27, 2006. Available atwww.asu.edu

Toward Effective Processing of InformationGraphics in Multimodal Documents:A Bayesian Network Approach

Sandra Carberry1 and Stephanie Elzer2

1 Department of Computer Science, University of Delaware, Newark, [email protected]

2 Department of Computer Science, Millersville University, Pennsylvania, [email protected]

Summary. Information graphics (non-pictorial graphics such as bar charts and linegraphs) are an important component of multimodal documents. When informationgraphics appear in popular media, such as newspapers and magazines, they gener-ally have a message that they are intended to convey. This chapter addresses theproblem of understanding such information graphics. The chapter presents a corpusstudy that shows the importance of taking information graphics into account whenprocessing a multimodal document. It then presents a Bayesian network approachto identifying the message conveyed by one kind of information graphic, simple barcharts, along with an evaluation of the graph understanding system. This work is thefirst (1) to demonstrate the necessity of understanding information graphics and tak-ing their communicative goal into account when processing a multimodal documentand (2) to develop a computational strategy for recognizing the communicative goalor intended message of an information graphic.

1 Introduction

Most documents are multimodal – that is, they consist of both text andgraphics. However, document processing research, including work on thesummarization, storage, and retrieval of documents, as well as automatedquestion-answering, has focused almost entirely on an article’s text; infor-mation graphics, such as bar charts and line graphs, have been ignored. Wecontend that information graphics play an important communicative role inmultimodal documents, and that they must be taken into account in sum-marizing and indexing the document, in answering questions from storeddocuments, and in providing alternative access to multimodal documents forindividuals with sight impairments.

This chapter has two objectives: (1) to demonstrate, via corpus studies,the necessity of understanding information graphics and taking their commu-nicative goal into account when processing a multimodal document, and (2) toS. Carberry and S. Elzer: Toward Effective Processing of Information Graphics in Multimodal

Documents: A Bayesian Network Approach, Studies in Computational Intelligence (SCI) 96,

191–212 (2008)


192 S. Carberry and S. Elzer

present a computational strategy for recognizing the communicative goal orintended message of one class of information graphics: simple bar charts. Ourwork is the first to produce a system for understanding information graphicsthat have an intended message (as opposed to graphics that are only intendedto present data). Since the message identified by our system can serve asa brief summary of an information graphic, our research provides the ba-sis for taking information graphics into account when processing multimodaldocuments.

The chapter is organized as follows. Section 2 relates our work to otherresearch efforts. Section 3 discusses the importance of information graphicsand presents the aforementioned corpus studies, along with two importantapplications that require analyzing and understanding information graphics.Section 4 presents an overview of our Bayesian network approach for recog-nizing the message conveyed by a simple bar chart, along with evaluationexperiments that demonstrate the system’s effectiveness, and Sect. 5 discussesproblems that must be addressed to handle the full range of multimodal doc-uments. Although our computational strategy for recognizing the intendedmessage of an information graphic is currently limited to simple bar charts,we believe that the general approach is extendible to other kinds of informa-tion graphics.

2 Related Work

Researchers have investigated the generation of information graphics and theircaptions in multimodal documents [10,19,22]. In graphics generation, the sys-tem is given a communicative goal and must construct a graphic that achievesthat goal. For example, the AutoBrief system [19] identifies the perceptual andcognitive tasks that a graphic must support and uses a constraint satisfactionalgorithm to design a graphic that facilitates these tasks as much as possible,subject to the constraints of competing tasks. In this context, perceptual tasksare ones that can be accomplished by viewing a graphic, such as comparingtwo bars in a bar chart to determine which is taller; cognitive tasks are onesthat require a mental computation, such as interpolating between the valuesassigned to two tick marks on the dependent axis in order to compute theexact value for the top of a bar in a bar chart. Our problem is the reverseof graphics generation – we are given a graphic and must extract the com-municative signals present in the graphic and use them to reason backwardsabout the graphic’s intended message.

Yu, Hunter, Reiter, and Sripada [40, 41] used pattern recognition tech-niques to summarize interesting features of time series data from a gas turbineengine. However, the graphs were automatically generated displays of the datapoints and did not have any intended message. Futrelle and Nikolakis [17] de-veloped a constraint grammar for parsing vector-based visual displays, andFutrelle is extending this work to construct a graphic that is a simpler form

Processing Information Graphics 193

of one or more graphics in a document [16]. However, the end result is itselfa graphic, not a representation of the graphic’s intended message. Our workis the first to address the understanding of an information graphic, with thegoal of processing multimodal documents.

Much effort has been devoted to the processing of images. Bradshaw [3]notes that work on image retrieval has progressed from systems that retrievedimages based on low-level features such as color, texture, and shape to systemswhich attempt to classify and reason about the semantics of the images beingprocessed. This includes systems that attempt to classify images according toattributes such as indoor/outdoor, city/landscape, and man-made/artificial.Srihari, Zhang, and Rao [35] examined text-based indexing techniques for thecaption and any collateral (accompanying) text combined with image-basedtechniques. Their work demonstrated the ineffectiveness of text-based meth-ods alone, and they provide the example of a search for pictures of Clintonand Gore, which produced a final set of 547 images. However, manual in-spection showed that only 76 of these images actually contained pictures ofClinton or Gore! Their work demonstrates, however, that when combined withimage-based retrieval techniques, the collateral text can provide a rich sourceof evidence for improving the information retrieval process. However, imageretrieval work is much different from our research, in that image retrievalis concerned with the semantics of images, such as “President Bush at theWhite House” or “an elephant on the plains of Africa”, whereas we are con-cerned with recognizing the communicative goal or intended message of aninformation graphic.

3 The Importance of Understanding InformationGraphics

Information graphics are non-pictorial graphics, such as bar charts and linegraphs, that display attributes of entities and relations among entities. Al-though some information graphics are only intended to display data [40, 41],the overwhelming majority of information graphics that appear in popularmedia, such as newspapers, magazines, and reports, have a message that theyare intended to convey. For example, the information graphic in Fig. 1 ostensi-bly is intended to convey the changing trend in optimism by small businesses.Clark [8] has argued that language consists of any deliberate signal that is in-tended to convey a message. Under this definition, language includes not onlytext and utterances, but also hand signals, facial expressions, and even infor-mation graphics. Thus, we view information graphics as a form of languagewith a communicative goal.


PERCENT

SMALL BUSINESSES:STILL UPBEAT

0

20

40

60

80

’02 ’03 ’04 ’05 ’06

COMPANIES OPTIMISTICABOUT U.S. ECONOMY

Fig. 1. A simple bar chart from business week

3.1 Can Information Graphics be Ignored?

The question arises as to whether information graphics repeat portions of thetextual content of a multimodal document and thus can be ignored. Considerthe information graphic in Fig. 1. It appeared in a short (1/2 page) Busi-ness Week article entitled Upstarts Plan to Keep On Spending. Although thegraphic’s message is that there is a changing trend (from falling to rising) inthe number of small business companies optimistic about the US economy,this message is not captured by the article’s text. The only part of the ac-companying article that comes close to the graphic’s message is the followingparagraph:

“A PricewaterhouseCoopers first-quarter survey, which ran from lateFebruary to May, showed 76% of the fast-growing small businesses –averaging an annual growth rate of about 25% – said they were opti-mistic about the US economy for the coming year.”

But nowhere in the article is the current optimism contrasted with the situa-tion a few years earlier. Moreover, the article contrasts the Pricewaterhouse-Coopers survey with a survey by the National Federation of IndependentBusiness (NFIB); the changing trend in the graphic, although not mentionedin the article’s text, is relevant to reconciling the differences in the two sur-veys. We observed that the same phenomenon occurred even with more com-plex graphics and longer articles. For example, consider the two graphics inFig. 21 that appeared in a 1999 Business Week article that was six pages inlength, of which approximately four pages were text; the article was entitled“A Small Town Reveals America’s Digital Divide”. Both graphics are groupedbar charts. The message of the leftmost graphic is twofold: at all income levels,rural areas lag behind urban areas in terms of US households with Internetaccess, and the percent of US households with Internet access increases with

1 The composite graphic contained three grouped bar charts. For reasons of space,only two are displayed here.The omitted graphic was a simple (not grouped) barchart addressing the relationship between race and Internet access.


BY EDUCATIONRURALURBAN

$75,000 PLUS

$10,000 −14,999

$15,000 −19,999

$20,000 −24,999

$25,000 −34,999

$35,000 −49,999

$50,000 −74,999

HIGH SCHOOL

SOME COLLEGE

B.A. OR MORE

BY INCOME

U.S. HOUSEHOLDS WITH INTERNET ACCESS

WIRED AMERICA: WHITE, URBAN, AND COLLEGE−EDUCATED

PERCENT PERCENT

ELEMENTARY

RURALURBAN

0 15 30 60 50403020100

Fig. 2. Two grouped bar charts from business week

income level. The message of the rightmost graphic is similar: at all educationlevels, rural areas lag behind urban areas in terms of US households with In-ternet access, and the percent of US households with Internet access increaseswith education level. Although this article explicitly refers to these graphicswith the reference “(chart, page 191)”, the text still fails to capture the graph-ics’ messages. The segments of the accompanying article that come closest tothe graphics’ messages are the following:

“Blacksburg2 reinforces fears that society faces a digital divide of enor-mous breadth (chart, page 191). Blacksburg is the most wired townin the nation. Over the span of only five years, more than 85% ofits 86,000 residents, including 24,000 students at Virginia Tech, havegone online – far above the 32.7% national average. By contrast, inthe region surrounding Blacksburg, only some 14% are connected tothe Net.”

“In Christiansburg3, nearly one-third of adults have no high schooldiploma and only 17% have college degrees – vs. 61% in Blacksburg.”“Price4 frequently gets frustrated at the second-class connectivity hehas as a result of where he lives, the family’s income, and his lack ofcomputer skills.”

But none of these text segments convey the relationship between income/ed-ucation and Internet access or that Internet access in rural areas lags behindthat in urban areas even for households with the same income and educationlevel, both of which are captured by the graphics. Furthermore, the readeris expected to connect the much lower income and education levels in rural

2 Blacksburg, Virginia is an urban area in Virgina and is the location of VirginiaTech, a large university.

3 Christiansburg is a rural town in Virginia.4 Price is the last name of a rural Virginia resident interviewed for the article.


247

49

Q1 Q1Q2Q4 Q4Q3Q1 Q2 Q3

2003 2004

more than $1 million in Florida:Number of resale condos sold forWorth a million

’05

50

100

150

200

250

Fig. 3. A line graph from USA today

areas (conveyed by the text) with the correlations between income/educationand Internet access (conveyed by the information graphics), and make theinference that a much lower percentage of rural residents have Internet accessthan urban residents. This conclusion is central to the article’s overall purposeof conveying the digital divide between rural and urban America.

Newspapers, as well as magazines, often rely on the reader to integrate themessages of an article’s information graphics into what is conveyed by the text.For example, Fig. 3 displays an information graphic taken from a 2005 USAToday article entitled “Miami condo market sizzling”. The graphic’s messageis ostensibly that the rising trend in the number of Florida resale condos soldfor more than a million dollars has risen even more sharply between 2004 and2005. But nowhere does the article talk about the price of resale condos. Thetext segment closest to the graphic’s message only addresses the price of newcondos:

“In Miami Beach and other communities, one-bedroom units in newoceanfronti projects start at close to $5,00,000 and run into themillions.”

Yet once again, the reader must recognize the message of the informationgraphic and integrate it with the communicative goals of the article’s text inorder to fully understand the article.

These observations lead to the hypothesis that information graphics can-not be ignored in processing multimodal documents. We conducted a corpusstudy to determine the extent to which the intended message of an informationgraphic in popular media is repeated in the article’s text. We examined 100randomly selected graphics from a variety of magazines, such as Newsweek,Time, Fortune, and Business Week, and from both local and national news-papers; the corpus of graphics included simple bar charts, grouped bar charts,line graphs, multiple line graphs, and a few pie charts, and the accompanyingarticles ranged from very short (less than half a page) to long (more than 2


Category A: Article’s text fully conveysthe graphic’s message

Category B: Article’s text mostly conveysthe graphic’s message

Category C: Article’s text conveys a littleof the graphic’s message

Category D: Article’s text conveys noneof the graphic’s message

Category C

17%

Category D

Category A

Category B

35%

22%

26%

Fig. 4. How often is a graphic’s message repeated in the accompanying article?

magazine length pages). We identified the text segments most closely relatedto each graphic’s message and placed the graphic in one of four categories,depending on the extent to which the graphic’s message was captured by thearticle’s text, as shown in Fig. 4. In 39% of the instances in our corpus (Cat-egories A and B), the text was judged to fully or mostly convey the messageof the information graphic. In the remaining 61% of the graphics (CategoriesC and D), the text was judged to convey little or none of the graphic’s mes-sage. Thus, since information graphics in popular media do not just repeatportions of an article’s text, they cannot be ignored in processing multimodaldocuments.

It is interesting to contrast the use of information graphics in popular me-dia with their use in scientific articles. The text of a scientific article generallyexplicitly refers to each information graphic and summarizes its message. Forexample, the above paragraph explicitly referred to Fig. 4 and summarized itscontribution, namely that the message of an information graphic appearing inpopular media is often not repeated in the article’s text. However, in popularmedia, explicit references to information graphics are not the norm; neitherof the graphics in Fig. 1 or 3 were explicitly referenced in their accompanyingarticles. And as illustrated by the graphics in Fig. 2, even when the articlerefers to the graphic, it might not summarize the graphic’s message.

3.2 How Useful are Naturally Occurring Captions?

Given that information graphics in a multimodal document cannot be ignored,perhaps the graphic’s caption can be relied on to capture the graphic’s in-tended message. Unfortunately, captions are of limited utility in automatingthe understanding of information graphics. In conjunction with their work ongenerating information graphics, Corio and Lapalme [10] analyzed the cap-tions on information graphics in order to devise rules for generating them.


7%Category C

Category B

Category D Category A44%

34%

15%

Category B: Caption captures intention (somewhat)

Category A: Caption captures intention (mostly)

Category C: Caption hints at intention

Category D: Caption makes no contribution to intention

Fig. 5. Does a graphic’s caption capture its intended message?

However, they found that captions are often very general. We conducted ourown corpus study with two objectives:

1. To identify the extent to which a graphic’s caption captures the graphic’sintended message

2. To determine whether a general purpose natural language system wouldencounter any problems in parsing and understanding captions

We compared the intended message5 of 100 bar charts with the graphic’scaption. Each graphic was placed into one of four categories, as shown inFig. 5. In slightly more than half the instances (Categories C and D), thegraphic’s caption either made no contribution to understanding the graphic’smessage or only hinted at it. For example, a caption might be very generaland uninformative about a graphic’s message, such as the caption “Delawarebankruptcies” that appeared on an information graphic in a local Delawarenewspaper conveying that there was a sharp rise in Delaware bankruptcies in2001 in contrast with the decreasing trend from 1998 to 2000, or a captionmight only hint at a graphic’s message, as is the case for the caption on thegraphic in Fig. 1.

Next we examined the 56 captions in Categories A, B, and C (those thatat least made some contribution to understanding the graphic’s message) toidentify how easily they could be parsed and understood by a general pur-pose natural language system. Unfortunately, we found that captions are oftensentence fragments or contain some other kind of ill-formedness. For example,the caption “Small Businesses: Still Upbeat” on the graphic in Fig. 1 is a sen-tence fragment, as is the overall caption “Wired America: White, Urban, andCollege-Educated” on the graphic in Fig. 2. Furthermore, many captions weredesigned to be cute or humorous, such as the Category-C caption “Bad Moon

5 The intended message had previously been annotated by two coders.


Rising” on a graphic that conveyed an increasing trend in delinquent debts.Interpretation of such captions would require extensive analogical reasoningthat is beyond the capability of current natural language systems.

3.3 Applications of Graphic Understanding

Although many research efforts have investigated the summarization of tex-tual documents ([20,23,24,27–29,34,38,39] are a few examples), little attentionhas been given to graphics in multimodal documents. Yet with the advent ofdigital libraries, the need for intelligent summarization, indexing, and retrievalof multimodal documents has become apparent [25,35].

To our knowledge, our work is the only research effort that has begun toaddress the issue of taking the messages conveyed by information graphicsinto account when processing and summarizing a multimodal document. Yetas our corpus analysis has shown, information graphics cannot be ignored.We contend that the core message of an information graphic can serve as abasis for incorporating the graphic into an overall summary of a multimodaldocument, thereby producing a richer summary that captures more of thedocument’s content.

Individuals who are blind face great difficulty when presented with multi-modal documents. Although screen readers such as JAWS can read the textto the user via speech, graphics pose serious problems. W3C accessibilityguidelines recommend that web designers provide textual equivalents for allgraphics and images [36]; however, the provision of such alt text is generallyignored or poorly constructed. The WebInSight project [2] seeks to addressthis issue for the broad class of images on the web by utilizing a combinationof optical character recognition, web context labeling, and human labeling toproduce alt text. However, given that a large proportion of information graph-ics lack helpful captions, this approach will not suffice. Researchers have de-vised systems that convey information graphics in alternative mediums suchas sound, tactile, or haptic representations [1,9,26,32,33,37]. However, theseapproaches have significant limitations, such as requiring expensive equipmentor requiring that the user construct a mental map of the graphic, somethingthat Kennel observed is very difficult for users who are congenitally blind [21].

We are taking a very different approach. Instead of attempting to trans-late the graphic into another modality, we hypothesize that the user should beprovided with the knowledge that would be gleaned from viewing the graphic.Thus we have designed a natural language system [15] that provides accessto multimodal documents by 1) identifying the message conveyed by its in-formation graphics (currently limited to simple bar charts), and 2) using ascreen reader to read the text to the user and to convey the messages of thedocument’s information graphics. This system will eventually include an inter-active dialogue capability in which the system responds to followup questionsfrom users for further detail about the graphic. Our approach has a num-ber of advantages, including not requiring expensive equipment and placingrelatively little cognitive load on the user.


4 A Graph Understanding System

As a first step toward processing multimodal documents, we have developed aBayesian system for identifying the intended message of a simple (not groupedor stacked) bar chart such as the graphic in Fig. 1. Simple bar charts providea rich domain for graph understanding. They can convey a variety of differentkinds of messages, such as trends, a contrast between a point in the graphicand a trend, a comparison between entities in a graphic, and the rank ofan entity in the graphic. In addition, a variety of mechanisms are used bygraphic designers to aid the user in recognizing the intended message of abar chart; such mechanisms include coloring a bar differently from other barsin the graphic, mentioning a bar’s label in the caption, and graphic designchoices that make some perceptual tasks easier than others. Figure 6 gives anoverview of our algorithm for processing a simple bar chart, and the steps ofthe algorithm are described in more detail in the following sections. Althoughour work has thus far been limited to simple bar charts, we believe that ourmethodology is extendible to other kinds of information graphics.

4.1 Input to the Graph Understanding System

Input to the graph understanding system is an XML representation of thegraphic that is produced by a Visual Extraction Module (VEM) [7]. It specifiesthe graphic’s components, including its axes, the location and heights of bars,the bar labels, their colors, the caption, etc. Although the VEM must processa raw image, the task is much more constrained, and thus much easier, thanmost image recognition problems. Currently, the VEM can handle electronicimages of simple bar charts that are clearly drawn in a fixed set of fontsand with standard placement of labels and captions. For example, the VEMcould not produce XML for the graphic in Fig. 1 since the text Companiesoptimistic about US Economy appears within the bars rather than above thegraphic or on the dependent axis. Current work is removing these limitations.If the independent axis of the bar chart represents an ordinal attribute suchas years or ages, a preprocessing phase uses a set of heuristics to divide thebars into consecutive segments that might represent possible trends and addsthe best division, along with any salient divisions, to the XML representationof the graphic. Further detail on this preprocessing can be found in [12].

4.2 A Bayesian Network for Graph Understanding

To generate multimodal documents, the AutoBrief project [22] first identi-fied which communicative goals would be achieved via text and which viagraphics. During the first phase of graphics generation, media-independentcommunicative goals were mapped to perceptual and cognitive tasks that thegraphics should support. For example, if the goal is for the viewer to believethat Company A had the highest profits of a set of companies, then it would


Input: electronic image of simple bar chartOutput: logical representation of the bar chart’s message

1. Construct XML representation of the bar chart’s components (done byVisual Extraction Module: Sect. 4.1)

2. If independent axis represents an ordinal attribute, augment XML represen-tation with division of bars into sequential subsegments representing possibletrends: (Sect. 4.1)

3. Augment XML representation to indicate the presence of a verb in one ofthe identified verb classes (done by Caption Processing Module: Sect. 4.3)

4. Augment XML representation to indicate the presence of a noun in thecaption that matches a bar label (done by Caption Processing Module:Sect. 4.3)

5. Construct the non-leaf nodes of the Bayesian network by chaining betweengoals and their constituent subgoals (Sect. 4.2)

6. Add conditional probability tables for each child node in the Bayesian net-work, as pre-computed from a corpus of bar charts (Sect. 4.4)

7. Add evidence nodes to each perceptual task node in the Bayesian network,reflecting evidence about whether that perceptual task is part of the planthat the viewer is intended to pursue in identifying the graphic’s message(Sect. 4.3)A. Add evidence capturing highlighting of the bars that are parameters ofthe perceptual taskB. Add evidence capturing annotation of the bars that are parameters ofthe perceptual taskC. Add evidence capturing the presence in the caption of nouns matchingthe labels of bars that are parameters of the perceptual taskD. Add evidence capturing whether a bar that is a parameter of the per-ceptual task stands out by being unusually tall with respect to other barsin the bar chartE. Add evidence capturing whether a bar that is a parameter of the percep-tual task is associated with the most recent date on a time lineF. Add evidence about the relative effort required for the perceptual task

8. Add evidence nodes to the top-level node in the Bayesian network captur-ing whether one of the identified verb or adjective classes is present in thecaption (Sect. 4.3)

9. Add conditional probability tables for each evidence node, as pre-computedfrom a corpus of bar charts (Sect. 4.4)

10. Propagate the evidence through the Bayesian network11. Select the message hypothesis with the highest associated probability

Fig. 6. Graph understanding algorithm

be desirable to design a graphic that facilitates the tasks of comparing theprofits of all the companies, locating the maximum profit, and identifying thecompany associated with the maximum. In the second phase of graphics gen-eration, a constraint satisfaction algorithm was used to design a graphic thatfacilitated these tasks to the best extent possible.


We view information graphics as a form of language, and take a planrecognition approach to recognizing the intended message of an informationgraphic. Plan recognition has been used extensively in understanding utter-ances and recognizing their intended meaning [4, 5, 31]. To understand infor-mation graphics, we reason in the opposite direction from AutoBrief – givenan information graphic, we extract the communicative signals present in thegraphic as a result of choices made by the graphic designer, and we use theseto recognize the plan that the graphic designer intends for the viewer to per-form in deciphering the graphic’s intended message. The top level goal of thisplan captures the graphic designer’s primary communicative goal, namely themessage that the graphic is intended to convey.

Following the approach introduced by Charniak and Goldman [6] for lan-guage understanding, we capture plan recognition in a probabilistic frame-work. The top level of our Bayesian network represents the twelve categoriesof messages that we have observed for simple bar charts, such as conveyinga trend (rising, falling, or stable), contrasting a point with a trend, convey-ing the rank of an entity, comparing two entities, etc. The next level of theBayesian network captures the possible instantiations of each of these messagecategories for the graphic being analyzed. For example, if a bar chart has sixbars, the parameter of the Get-Rank message category could be instantiatedwith the labels of any of the six bars. Lower levels in the Bayesian networkrepresent decompositions of the communicative goal represented by the parentnode into more specific subgoals and eventually into primitive perceptual andcognitive tasks that the viewer would be expected to perform. For example,getting the rank of a bar can be accomplished either by getting a bar’s rankgiven its label (perhaps the bar’s label was mentioned in the caption, therebymaking it salient to the viewer) or by getting a bar’s rank starting with the bar(perhaps the bar has been highlighted to draw attention to it in the graphic).Getting a bar’s rank given its label lx can be further decomposed into thethree perceptual tasks:

1. Perceive-bar: perceive the bar bx whose label is lx2. Perceive-If-Sorted: perceive whether the bars appear in sorted order in

the bar chart3. Perceive-Rank: perceive the rank of bar bx in the bar chart. (This task is

much easier if the bars are in sorted order, as will be discussed in Sect. 4.3.)

Given an information graphic, our system constructs the Bayesian networkfor it using the Netica [30] software for building and reasoning with Bayesiannetworks.

4.3 Entering Evidence into the Bayesian Network

In order to reason about the graphic’s most likely high-level communicativegoal and thereby recognize the graphic’s intended message, evidence fromthe graphic must be entered into the Bayesian network. The evidence takes


the form of communicative signals present in the graphic, both as a result ofdesign choices made by the graphic designer and mutual beliefs of the designerand viewer about what the viewer will be interested in. These communicativesignals are multimodal in the sense that some are visual signals in the graphicitself and some take the form of words in the caption assigned to the graphic.

Our first set of communicative signals result from explicit actions on thepart of the graphic designer that draw attention to an entity in the graphic.These include highlighting a bar by coloring it differently from other barsin the bar chart, annotating a bar with its value or a special symbol, andmentioning the bar’s label in the caption. The XML representation of thegraphic contains each bar’s color and any annotations, so identifying bars thatare salient due to highlighting or annotation is easy. Our Caption ProcessingModule [13] uses a part-of-speech tagger to extract nouns from the captionand match them against the bar labels, thereby identifying any bars that aresalient by virtue of being mentioned in the caption.

Our second set of communicative signals take into account presumed mu-tual beliefs by the graphic designer and the viewer about entities that willdraw the viewer’s attention. Thus any bars that are much taller than otherbars in the bar chart or a bar associated with the most recent date on atimeline are noted as salient entities, since viewers will presumably notice abar that differs significantly in height from the other bars and will be mostinterested in recent events.

Our third set of communicative signals are the relative difficulty of differ-ent perceptual tasks in the graphic. The design of a graphic can make someperceptual tasks easier than others. For example, it is much easier to identifythe taller of two bars in a bar chart if the two bars are located adjacent toone another and are significantly different in height than if they are inter-spersed with other bars and their heights are similar. We have adopted theAutoBrief hypothesis [22] that graphic designers construct a graphic that fa-cilitates as much as possible the most important perceptual tasks for achievingthe graphic’s communicative goal. Thus the relative difficulty of different per-ceptual tasks serves as a communicative signal about which tasks the viewerwas intended to perform in deciphering the graphic’s message.

To extract this communicative signal from a bar chart, we constructed aset of effort estimation rules that compute the effort required for a variety ofperceptual tasks that might be performed on a given graphic. Each rule rep-resents a perceptual task and consists of a set of condition-computation pairs.Each condition part of a rule captures characteristics of the graphic that mustapply in order for its associated computation to be applicable. For example,consider the bar chart displayed in Fig. 7. It illustrates three conditions thatmight hold in a bar chart: (1) a bar might be explicitly annotated with itsvalue, as is the case for the bar labelled Norway; (2) a bar might not be anno-tated with its value, but the top of the bar might be aligned with a labelledtick mark on the dependent axis, as is the case for the bar labelled Denmark;or (3) determining the bar’s value might require interpolation between the


Fig. 7. A bar chart illustrating different amounts of perceptual effort

values of two labelled tick marks on the dependent axis, as is the case for thebar labelled Britain. Our rule for estimating the effort required to determinethe value associated with the top of a bar captures each of these differentconditions, listed in order of increasing effort required to perform the task,and specifies the computation to apply when the condition is satisfied; the as-sociated effort computations are based on research by cognitive psychologists.Our effort estimation rules were validated by eyetracking experiments and arepresented in [14].

The above communicative signals provide evidence regarding perceptualtasks that the viewer might be intended to perform. Each instance of a per-ceptual task has instantiated parameters; for example, the perceptual task

Perceive-Rank( viewer, bar, rank)

has bar as one of its parameters. If the particular bar instantiating the barparameter is salient by virtue of a communicative signal in the graphic, thenthat serves as evidence that the viewer might be intended to perform thisparticular perceptual task. Similarly, the amount of effort required to performthe Perceive-Rank task also serves as evidence about whether the viewer wasreally intended to perform the task.6 Thus evidence nodes capturing thesecommunicative signals are attached to each primitive perceptual task node inthe Bayesian network.

Our last set of communicative signals are the presence of a verb or adjectivein the caption that suggests a particular category of message. For example,although it would be very difficult to extract the graphic’s message fromthe humorous caption Bad Moon Rising7, the presence of the verb rising

6 Note that if the bars appear in order of height in the bar chart, then the effortrequired for the Perceive-Rank task will be much lower than if they are ordereddifferently, such as in alphabetical order of their labels.

7 This caption appeared on a graphic conveying an increasing trend in delinquentdebts.


Table 1. A sample conditional probability table

PerceiveRank( viewer, bar, rank) InPlan NotInPlan

Only bar is annotated 24.99 2.3bar and others are annotated 0.01 0.9

only bars other than bar are annotated 0.01 19.5no bars are annotated 74.99 77.3

suggests the increasing trend category of message. We identified a set of verbsthat might suggest one of our 12 categories of messages and organized theminto classes containing similar verbs. For example, one verb class containsverbs such as rise, increase, grow, improve, surge, etc. Our Caption ProcessingModule identifies the presence in the caption of a verb from one of our verbclasses or an adjective (such as growing in the caption “A Growing BiotechMarket”) that is derived from such a verb. Since this kind of communicativesignal suggests a particular category of high-level message, verb and adjectiveevidence nodes are attached to the top-level node in the Bayesian network.

4.4 Computing the Probability Tables

Associated with each child node in a Bayesian network is a conditional prob-ability table that specifies the probability of the child node given the valueof a parent node. For our application, the value of the parent node is eitherthat it is, or is not, part of the plan that the viewer is intended to pursuein recognizing the graphic’s message. Table 1 displays the conditional proba-bility table for the annotation evidence node attached to the Perceive-Ranknode in the Bayesian network. It indicates that if the viewer is intended toperceive the rank of the particular bar that instantiates bar, then the prob-ability is 24.99% that this particular bar is the only bar annotated, and theprobability is 74.99% that no bars are annotated in the graphic. Negligiblenon-zero probabilities are assigned to situations in which this bar and othersare annotated or in which only other bars are annotated. Similarly, the tablecaptures the probability of the bar being annotated given that Perceive-Rankis not part of the plan that the viewer is intended to pursue.

4.5 Examples of Message Recognition

Consider the graphic displayed in Fig. 8. The graphic’s caption is uninforma-tive about the graphic’s intended message; it could be attached to a graphicconveying a variety of messages, including the relative rank of different recordcompanies in terms of album sales or a comparison of the sales of two partic-ular record companies. However, our system hypothesizes that the graphic isconveying a changing trend in record album sales, with sales rising from 1998to 2000 and then falling from 2000 to 2002.


59

76

92

7873

20022001200019991998

The sound of salesTotal albums sold in first quarterIn millions

Fig. 8. A slight variation of a graphic from USA today

GD

P P

er C

apita,

200

1 (in

thou

sand

s)

40

30

20

10

0

U.S

.

Brita

in

Lux

embo

urg

Nor

way

Sw

itze

rlan

d

Japa

n

Den

mar

k

Fig. 9. A variation of a graphic from US news and world report (In the originalgraphic, the bar for the United States was annotated. Here we have highlighted it.We have also placed the dependent axis label alongside the dependent axis, insteadof at the top of the graph.)

Now consider the graphic in Fig. 9. If the bar for the United States isnot highlighted, then our system hypothesizes that the graphic is conveyingthe relative rank of the different countries in terms of GDP per capita. How-ever, when the bar for the United States is colored differently from the otherbars, as in Fig. 9, it becomes salient. In this case, our system hypothesizesa different message – namely, that the graphic is intended to convey thatthe United States ranks third in GDP per capita among the countries listed.Similar results would be obtained if the bar for the United States were nothighlighted, but a caption such as “United States Productivity” were attachedto the graphic, thereby again making the bar for the United States salient bymentioning its label in the caption.


4.6 Evaluation

We evaluated the effectiveness of our graph understanding system on a corpusof 110 bar charts whose intended message had been previously annotated bytwo coders. Since the corpus is small, we used leave-one-out cross validation inwhich each bar chart was used once as the test graphic and the other 109 barcharts were used to compute the probabilities for the nodes in the Bayesiannetwork. The system was credited with success if its top-rated hypothesismatched the message assigned to the bar chart by the human coders and theprobability that the system assigned to its hypothesis exceeded 50%. Overallsuccess was computed as the average of all 110 experiments. Our system’ssuccess rate was 79.1%, which far exceeds any baselines such as the frequencyof the most prevalent type of message (rising trend at 23.6%). But it shouldbe noted that the system must identify both the category and parameters ofthe message. For example, the system must not only recognize when a barchart is conveying the rank of an entity in the graphic but must also identifythe specific entity in question.

Since we are interested in the impact of different communicative signalsand their particular modality, we undertook an additional experiment in whichwe evaluated how each kind of evidence impacted our system’s ability torecognize the graphic’s message. As a baseline, we used the system’s successrate when all evidence nodes are included in the network, which is 79.1%.For each type of evidence, we then computed the system’s success rate whenthat evidence node was disabled in the Bayesian network, and we analyzedthe resulting degradation in performance (if any) from the baseline. It shouldbe noted that disabling an evidence source means that we remove the abilityof that kind of evidence to contribute to the probabilities in the Bayesiannetwork. This differs from merely failing to record the presence of that kindof evidence in the graphic, since both the presence and absence of a particularcommunicative signal is evidence.

We used a one-tailed McNemar test [11,18] for the significance of change inrelated samples. Our samples are related since we are comparing performanceby a baseline system with performance by a system that has been perturbedby omitting an evidence source. Table 2 displays the results for the evidencesources where the performance degradation is significant at the .05 level orbetter.

It is interesting to note that the evidence sources that affect performanceinclude signals from both the visual modality (such as highlighting in thegraphic and the relative effort of different perceptual tasks) and the textualmodality (such as a noun in the caption matching a bar label in the graphic).

Disabling evidence regarding the mention of a bar label in the caption (ref-erred to as Noun-matching-bar-label in Table 2) caused the greatest degra-dation in performance. We examined those bar charts where a bar label wasreferenced in the caption and the intended message was correctly identified bythe baseline system with all evidence sources enabled. We found that in ten


Table 2. Degradation in performance with omission of evidence source

Baseline: system with all evidence 79% success rate

Success McNemar pType of evidence omitted rate (%) statistic value

Noun-matching-bar-label evidence 70 8.100 .005Effort evidence 71 5.818 .01Current-date evidence 72 6.125a .01Highlighting evidence 74 3.125 .05Salient-height evidence 74 3.125 .05

a The McNemar test is based on (1) the number correct by System-1 and wrongby System-2, and (2) the number wrong by System-1 and correct by System-2.Thus although a greater difference in success rates usually correlates with greaterstatistical significance, this is not always the case

instances where other evidence made the referenced bar salient (such as high-lighting the bar or the bar being significantly taller than other bars in the barchart), the system with Noun-matching-bar-label evidence disabled was stillable to recognize the graphic’s intended message. Thus we see that althoughthe absence of one evidence source may cause performance to degrade, thisdegradation can be mitigated by other compensating evidence sources.

5 Conclusion and Discussion

In this chapter, we have demonstrated the importance of information graph-ics in a multimodal document. We have also shown that a graphic’s captionis often very general and uninformative, and therefore cannot be used as asubstitute for the graphic. Thus it is essential that information graphics beunderstood and their intended messages taken into account when processingmultimodal documents. Our graph understanding system is a first step towardthis goal. It extracts communicative signals from an information graphic andenters them into a Bayesian network that can hypothesize the message con-veyed by the graphic. To our knowledge, no other research effort has addressedthe problem of inferring the intended message of an information graphic.

Our implemented system is limited to simple bar charts; we are currentlyextending our methodology to other kinds of information graphics, such as linegraphs and grouped bar charts. The latter are particularly interesting sincethey often convey two messages, as was seen for the graphics in Fig. 2. We arealso investigating the synergy between recognition of a graphic’s message andidentifying the topic of an article. Our graph understanding system exploitscommunicative signals in the graphic and its caption. However, if an entityin the graphic is mentioned in the article’s text, it becomes salient in thegraphic. On the other hand, the graphic can suggest the focus or topic of


the article. For example, one graphic in our corpus highlights the bar forAmerican Express, and the intended message hypothesized by our system isthat the graphic conveys the rank of American Express among the credit cardcompanies listed. Although the article mentions a number of different creditcard companies, the focus of the graphic is on American Express and thissuggests that the article is about American Express.

Our system for providing blind individuals with effective access to multi-modal documents is being field-tested, and the initial reaction from users isvery positive. Currently, only the graphic’s intended message is included inthe initial summary of the graphic that is presented to the user. Our next stepis to identify what additional information (if any) should be included, alongwith the intended message, in the initial summary. For example, if a bar chartconveys an overall rising trend but one bar deviates from this trend, shouldthis exceptional bar be mentioned in the initial summary of the graphic?Furthermore, should the graphic’s initial summary repeat information in thearticle’s text? For example, if it is deemed important to mention the valuesat the end points of the trend, should this information be repeated in thegraphic’s initial summary if it is already part of the article’s text that is beingread to the user?8 And finally, we must develop the interactive natural lan-guage dialogue capability that will enable the user to ask followup questionsregarding the graphic.

The next step in our digital libraries project is to develop a summarizationstrategy that takes into account both a document’s text and the messagesconveyed by its information graphics. This will entail determining when thegraphic’s message is redundant and has already been captured by the text. Wemust also develop a method for coherently integrating the graphic’s messagewith a summary of the article’s text. Given the importance of informationgraphics in a multimodal document, we believe that our approach will resultin a richer and more complete summary, which can then be used to moreeffectively index and retrieve documents in a digital library.

Acknowledgements

This material is based upon work supported by the National Science Founda-tion under Grant No. IIS-0534948.

References

1. James Alty and Dimitrios Rigas. Exploring the use of structured music stimulito communicate simple diagrams: The role of context. International Journal ofHuman-Computer Studies, 62(1):21–40, 2005.

8 This question was raised by Seniz Demir and Kathy McCoy, colleagues on theproject.


2. Jeffrey Bigham, Ryan Kaminsky, Richard Ladner, Oscar Danielsson, andGordon Hempton. Webinsight: Making web images accessible. In Proceedingsof the Eighth International ACM SIGACCESS Conference on Computers andAccessibility, pages 181–188, 2006.

3. Ben Bradshaw. Semantic based image retrieval: A probabilistic approach. InProceedings of the 8th ACM International Conference on Multimedia, pages167–176, 2000.

4. Sandra Carberry. Plan Recognition in Natural Language Dialogue. ACL-MITPress Series on Natural Language Processing. MIT, Cambridge, Massachusetts,1990.

5. Sandra Carberry. Techniques for plan recognition. User Modeling and User-Adapted Interaction, 11(1–2):31–48, 2001.

6. Eugene Charniak and Robert Goldman. A bayesian model of plan recognition.Artificial Intelligence Journal, 64:53–79, 1993.

7. Daniel Chester and Stephanie Elzer. Getting computers to see informationgraphics so users do not have to. In Proceedings of the 15th International Sym-posium on Methodologies for Intelligent Systems, pages 660–668, 2005.

8. Herbert Clark. Using Language. Cambridge University Press, Cambridge, 1996.9. Robert F. Cohen, Arthur Meacham, and Joelle Skaff. Teaching graphs to visually

impaired students using an active auditory interface. In SIGCSE ’06: Proceed-ings of the 37th SIGCSE technical symposium on Computer science education,pages 279–282, 2006.

10. Marc Corio and Guy Lapalme. Generation of texts for information graphics.In Proceedings of the 7th European Workshop on Natural Language GenerationEWNLG’99, pages 49–58, 1999.

11. Wayne Daniel. Applied Nonparametric Statistics. Houghton Mifflin, Boston,1978.

12. Stephanie Elzer. A Probabilistic Framework for the Recognition of Intention inInformation Graphics. PhD thesis, University of Delaware, Newark, DE 19716,2006.

13. Stephanie Elzer, Sandra Carberry, Daniel Chester, Seniz Demir, Nancy Green,Ingrid Zukerman, and Keith Trnka. Exploring and exploiting the limited utilityof captions in recognizing intention in information graphics. In Proceedingsof the 43rd Annual Meeting of the Association for Computational Linguistics,pages 223–230, 2005.

14. Stephanie Elzer, Nancy Green, Sandra Carberry, and James Hoffman. A modelof perceptual task effort for bar charts and its role in recognizing intention. UserModeling and User-Adapted Interaction, 16(1): 1–30, 2006.

15. Stephanie Elzer, Edward Schwartz, Sandra Carberry, Daniel Chester, SenizDemir, and Peng Wu. A browser extension for providing visually impaired usersaccess to the content of bar charts on the web. In International Conference onWeb Information Systems, pages 59–66, 2007.

16. Robert Futrelle. Summarization of diagrams in documents. In I. Maniand M. Maybury, editors, Advances in Automated Text Summarization, pages403–421. MIT, Cambridge, 1999.

17. Robert Futrelle and Nikos Nikolakis. Efficient analysis of complex diagramsusing constraint-based parsing. In Proceedings of the Third International Con-ference on Document Analysis and Recognition, pages 782–790, 1995.

18. Graphpad software. quickcalcs: Online calculators for scientists (2002).http://www.graphpad.com/quickcalcs/McNemarEx.cfm.


19. Nancy Green, Giuseppe Carenini, Stephan Kerpedjiev, Joe Mattis, JohannaMoore, and Steven Roth. Autobrief: An experimental system for the automaticgeneration of briefings in integrated text and graphics. International Journal ofHuman-Computer Studies, 61(1):32–70, 2004.

20. E. Hovy and C.-Y. Lin. Automated text summarization in summarist. In I. Maniand M. Maybury, editors, Advanced in Automatic Text Summarization, pages81–94. MIT, Cambridge, 1999.

21. A. Kennel. Audiograf: A diagram-reader for the blind. In Second Annual ACMConference on Assistive Technologies, pages 51–56, 1996.

22. Stephan Kerpedjiev and Steven Roth. Mapping communicative goals into con-ceptual tasks to generate graphics in discourse. In Proceedings of the Interna-tional Conference on Intelligent User Interfaces, pages 60–67, 2000.

23. Inderjeet Mani and Mark Maybury, editors. Advances in Automatic Text Sum-marization. MIT, Cambridge, 1999.

24. Daniel Marcu. The rhetorical parsing of unrestricted texts: A surface-basedapproach. Computational Linguistics, 26(3):395–448, 2000.

25. Mark Maybury, editor. Intelligent Multimedia Information Retrieval. MIT,Cambridge, 1997.

26. David K. McGookin and Stephen A. Brewster. Soundbar: exploiting multipleviews in multimodal graph browsing. In NordiCHI ’06: Proceedings of the 4thNordic conference on Human-computer interaction, pages 145–154, 2006.

27. Marie-Francine Moens, Roxana Angheluta, and Jos Dumortier. Generic tech-nologies for single and multi-document summarization. Information Processingand Management, 41(3):569–586, 2005.

28. Jane Morris and Graeme Hirst. Non-classical lexical semantic relations. InProceedings of the HLT Workshop on Computational Lexical Semantics, pages46–51, 2004.

29. Ani Nenkova. Automatic text summarization of newswire: Lessons learned fromthe document understanding conference. In Proceedings of National Conferenceon Artificial Intelligence (AAAI), pages 1436–1441, 2005.

30. Norsys Software Corp.: Netica, 2005.31. Raymond Perrault and James Allen. A Plan-Based Analysis of Indirect Speech

Acts. American Journal of Computational Linguistics, 6(3–4):167–182, 1980.32. Rameshsharma Ramloll, Wai Yu, Stephen Brewster, Beate Riedel, Mike Burton,

and Gisela Dimigen. Constructing sonified haptic line graphs for the blind stu-dent: First steps. In Proceedings of the 4th ACM Conference on Assistive Tech-nologies, pages 17–25, 2000.

33. Martin Rotard, Sven Knodler, and Thomas Ertl. A tactile web browser forthe visually disabled. In HYPERTEXT ’05: Proceedings of the sixteenth ACMconference on Hypertext and hypermedia, pages 15–22, 2005.

34. Barry Schiffman, Inderjeet Mani, and Kristian Concepcion. Producing bio-graphical summaries: Combining linguistic knowledge with corpus statistics. InProceedings of the 39th Annual Meeting of the Association for ComputationalLinguistics, pages 450–457, 2001.

35. Rohini K. Srihari, Zhongfei Zhang, and Aibing Rao. Intelligent indexing andsemantic retrieval of multimodal documents. Information Retrieval, 2(2):1–37,2000.

36. W3c: Web accessibility initiative. http://www.w3c.org/wai/.


37. Steven Wall and Stephen Brewster. Tac-tiles: Multimodal pie charts for visuallyimpaired users. In Proceedings of the 4th Nordic Conference on Human-computerInteraction, pages 9–18, 2006.

38. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. Towards an iterativereinforcement approach for simultaneous document summarization and keywordextraction. In Proceedings of the 45th Annual Meeting of the Association forComputational Linguistics, pages 552–559, 2007.

39. Jen-Yuan Yeh, Hao-Ren Ke, Wei-Pang Yang, and I-Heng Meng. Text summa-rization using a trainable summarizer and latent semantic analysis. InformationProcessing and Management, 41(1):75–95, 2005.

40. Jin Yu, Ehud Reiter, Jim Hunter, and Chris Mellish. Choosing the content oftextual summaries of large time-series data sets. Natural Language Engineering,13:25–49, 2007.

41. Jin Yu, Jim Hunter, Ehud Reiter, and Somayajulu Sripada. Recognisingvisual patterns to communicate gas turbine time-series data. In Proceedings of22nd SCAI International Conference on Knowledge-Based Systems and AppliedArtificial Intelligence (ES2002), pages 105–118, 2002.

Fuzzy Audio Similarity Measures Basedon Spectrum Histograms and FluctuationPatterns

Klaas Bosteels and Etienne E. Kerre

Fuzziness and Uncertainty Modelling Research GroupDepartment of Applied Mathematics and Computer ScienceGhent University, Krijgslaan 281 (S9), B-9000 Gent, [email protected], [email protected]

Summary. Spectrum histograms and fluctuation patterns are representations ofaudio fragments. By comparing these representations, we can determine the sim-ilarity between the corresponding fragments. Traditionally, this is done using theEuclidean distance. In this chapter, however, we study an alternative approach,namely, comparing the representations by means of fuzzy similarity measures. Oncethe preliminary notions have been addressed, we present a recently introduced tri-parametric family of fuzzy similarity measures, together with several constraintson its parameters that warrant certain potentially desirable or useful properties. Inparticular, we present constraints for several forms of restrictability, which allow toreduce the computation time in practical applications. Next, we use some membersof this family to construct various audio similarity measures based on spectrum his-tograms and fluctuation patterns. To conclude, we analyse the performance of theconstructed audio similarity measures experimentally.

1 Introduction

Portable audio players can store several thousands of songs these days, andonline music stores currently offer millions of tracks. This abundance of musicdrastically increases the need for applications that automatically analyse, re-trieve or organize audio files. Measures that are able to express the similaritybetween two given audio fragments, are a fundamental component in many ofthese applications (e.g. [1–6]). In particular, many computational intelligencemethods for organizing and exploring music collections rely on such an audiosimilarity measure. The SOM-enhanced JukeBox presented in [6], which usesunsupervised neural networks to build “geographical” maps of music archives,is a noteworthy example.

Usually, audio similarity measures are constructed using a feature-basedapproach. The audio fragments are represented by real-valued feature vectors,

K. Bosteels and E.E. Kerre: Fuzzy Audio Similarity Measures Based on Spectrum Histograms

and Fluctuation Patterns, Studies in Computational Intelligence (SCI) 96, 213–231 (2008)


214 K. Bosteels and E.E. Kerre

and the similarity is calculated by comparing these vectors. We consider twotypes of feature vectors in this chapter: Spectrum histograms and fluctuationpatterns. So far, the Euclidean distance has always been used for comparingfeature vectors of these types. By identifying the feature vectors with fuzzysets, however, the possibility arises to use fuzzy similarity measures for thistask. In this chapter, we investigate this alternative approach.

2 Related Work and Motivation

The audio similarity measure introduced by Aucouturier and Pachet in [1],which can be regarded as an improvement of a technique by Logan andSalomon [7], is well-known in its field. This measure calculates the similar-ity between two given audio fragments by comparing mixtures of Gaussiandistributions that model the spectral information in the fragments. Mandaland Ellis proposed a simplified version of this approach [8]. They use a singleGaussian to model the spectral information, and compute the distance be-tween two of these Gaussians by means of the symmetric Kullback–Leiblerdivergence. Calculating the Euclidean distance between the spectrum his-tograms [4] derived from the audio fragments is an alternative spectral ap-proach that is even easier to implement and compute. Nevertheless, the ex-perimental evaluation in [9] indicates that this approach based on spectrumhistograms can outperform the above-mentioned more complex techniques insome cases.

Fluctuation patterns, which were originally called rhythm patterns [5], con-tain information that is complementary to spectral characteristics. Therefore,Pampalk combined a spectral audio similarity measure with the Euclidean dis-tance between fluctuation patterns, and further optimized this combination bytaking into account some additional information derived from the fluctuationpatterns [3]. This led to the audio similarity measure that won the MIREX’06(Music Information Retrieval Evaluation eXchange 2006) audio-based musicsimilarity and retrieval task.1

Hence, both spectrum histograms and fluctuation patterns can be con-sidered to be audio representations that play an important role in the cur-rent state of the art. Since the Euclidean distance has always been used tocompare these representation so far, employing other approaches for the com-parison is an interesting research direction that still needs to be explored.As mentioned in the introduction, we propose fuzzy similarity measures asalternatives for the Euclidean distance in this chapter. This does not addany unwanted complexity because many fuzzy similarity measures are veryeasy to implement and compute, and fuzzy similarity measures offer the addi-tional advantage of being studied extensively and having very solid theoreticalfoundations. The main goal of this chapter is demonstrating that by using

1 http://www.music-ir.org/mirex2006.

Fuzzy Audio Similarity Measures 215

fuzzy similarity measures instead of the Euclidean distance for comparing thespectrum histograms or fluctuation patterns, we can obtain a framework forgenerating theoretically well-founded audio similarity measures that satisfythe specific properties required for a particular application, and that performat least as well as the corresponding audio similarity measures based on theEuclidean distance.

3 Preliminaries

3.1 The Considered Representations of Audio Fragments

Audio fragments contain a lot of information. Therefore, they are typicallyreduced to relatively compact real-valued vectors before they are compared.Such a vector is usually called a feature vector , and its individual componentsare called features. Feature extraction is the process of converting a given audiofragment to the corresponding feature vector. Many types of feature vectorshave been suggested in the literature. In this chapter, we restrict ourselvesto spectrum histograms and fluctuation patterns. Both are derived from aspectrogram.

Spectrograms

For a given audio segment, the Fourier transform can be used to calculatethe amplitude that corresponds with each frequency. By dividing an audiofragment in short subsequent segments and applying the Fourier transformto each of these segments, we get the amplitude for each time-frequency pair.Such a representation of an audio fragment is called a spectrogram. The in-dividual frequencies of a spectrogram are usually consolidated into frequencybands to reduce the computation time. Furthermore, the amplitudes are nor-mally converted to loudness values, i.e., values that are proportional to theperceived intensity of the frequency in question.

We consider two types of spectrograms in this chapter. For the first type,we use the Mel scale for the frequencies, and the decibel scale for the loudnessvalues. We call spectrograms of this type Mel spectrograms. The scales used forthe second type of spectrograms are bark and sone, instead of Mel and deci-bel, respectively. We use the term sonogram for a spectrogram of this type. Intheory, the sonogram should perform best because its scales correspond bet-ter with human perception. An incontestable disadvantage of the sonogram,however, is the fact that it requires significantly more computation time.

Spectrum Histograms

Starting from a spectrogram, we can calculate a spectrum histogram (SH) [3,4]by counting how many times certain loudness levels are reached or exceeded


10 20 30 40 50 60

10

20

30

loudness level

freq

uenc

y ba

nd

(a)

10 20 30 40 50 60

10

20

30

modulation frequency

freq

uenc

y ba

nd

(b)

Fig. 1. The sonogram-based SH (a) and FP (b) for a fragment of “Temps pournous” by Axelle Red. White depicts zero, and black represents the maximum value

in each frequency band. In this way, we get a simple summarization of thespectral shape of the audio fragment. This summarization is, to some extent,related to the perceived timbre of the audio fragment.

We used two implementations of SHs for this chapter. The first implemen-tation is based on the Mel spectrogram, and the second one on the sonogram.Both implementations are written in Matlab using the MA toolbox [10], andin both cases a SH is a matrix with 30 rows (frequency bands) and 60 columns(loudness levels). Figure 1a shows an example of a SH.

Fluctuation Patterns

By applying the Fourier transform to the subsequent loudness values in eachfrequency band of a segment of a spectrogram, we obtain the amplitudesthat correspond with the loudness modulation frequencies for each frequencyband. We get the fluctuation pattern (FP) [3, 5] for an audio fragment bycalculating weighted versions of these coefficients for subsequent segments ofthe spectrogram, and then taking the mean of the values obtained for eachsegment. Since FPs describe the loudness fluctuations for each frequency band,they are, to some extent, related to the perceived rhythm.

For implementing FPs, we again used the MA toolbox. Our first imple-mentation derives FPs from the Mel spectrogram, while the second one usesthe sonogram. Both implementations generate FPs that are, like the SHs, 30by 60 matrices in which the rows correspond with frequency bands. In thiscase, however, the columns represent modulation frequencies (ranging from 0to 10 Hz). Figure 1b shows an example.

3.2 Mathematical Foundations

In this subsection, we introduce some basic notions from fuzzy set theory,namely, fuzzy sets, fuzzy aggregation operators, and fuzzy similarity measures.

Fuzzy Sets

Let X be the universe of discourse, i.e., the set of all considered objects. In thecase of an ordinary or crisp set A in X, we have either x ∈ A or x /∈ A for each


element x from X. Hence, a crisp set can be represented by a characteristicX → {0, 1} mapping. To avoid notational clutter, we reuse the name of acrisp set for its characteristic mapping. For instance, ∅ denotes the empty setas well as the mapping from the universe X to [0, 1] given by: ∅(x) = 0, forall x ∈ X. We use the notation P(X) for the class of crisp sets in X, and wewrite PF(X) for the set of all finite crisp sets in X.

Now, the concept of a crisp set can be generalized as follows:

Definition 1. A fuzzy set A in a universe X is a X → [0, 1] mapping thatassociates with each element x from X a degree of membership A(x).

We use the notation F(X) for the class of fuzzy sets in X. For two fuzzy setsA and B in X, we write A ⊆ B if A(x) ≤ B(x) for all x ∈ X, and A = B iffA ⊆ B ∧ B ⊆ A.

The classical set-theoretic operations intersection and union can be gen-eralized to fuzzy sets by means of a conjunctor and a disjunctor.

Definition 2. A conjunctor C is an increasing [0, 1]2 → [0, 1] mapping thatsatisfies C(0, 0) = C(0, 1) = C(1, 0) = 0 and C(1, 1) = 1.

Definition 3. A disjunctor D is an increasing [0, 1]2 → [0, 1] mapping thatsatisfies D(1, 1) = D(0, 1) = D(1, 0) = 1 and D(0, 0) = 0.

Definition 4. Let C be a conjunctor. The C-intersection A ∩CB of two fuzzysets A and B in X is the fuzzy set in X given by, for all x ∈ X:

(A ∩CB)(x) = C(A(x), B(x)) (1)

Definition 5. Let D be a disjunctor. The D-union A ∪DB of two fuzzy setsA and B in X is the fuzzy set in X given by, for all x ∈ X:

(A ∪DB)(x) = D(A(x), B(x)) (2)

To conclude this section, we define the concepts support and sigma count:

Definition 6. The support suppA of a fuzzy set A in X is given by:

suppA = {x ∈ X | A(x) > 0} (3)

Definition 7. The sigma count |A| of a fuzzy set A in X with finite supportis given by:

|A| =∑x∈X

A(x) (4)

It is not hard to see that the sigma count is a generalization of the crisp conceptcardinality to fuzzy sets. As stated in its definition, this generalization is onlydefined for fuzzy sets with finite support. We call such fuzzy sets finite, andwe use the notation FF(X) for the class of finite fuzzy sets in X. Obviously,all fuzzy sets in a finite universe X are finite. In the remaining of this chapter,X always denotes a finite universe.


Fuzzy Aggregation Operators

Definition 8. A fuzzy aggregation operator H of arity n ∈ N \ {0} is anincreasing [0, 1]n → [0, 1] mapping that satisfies H(0, 0, . . . , 0) = 0 andH(1, 1, . . . , 1) = 1.

Fuzzy aggregation operators of arity n are often said to be n-ary. Binaryfuzzy aggregation operators are operators of arity 2. Also, note that we cannaturally extend the usual order on R to a partial order on fuzzy aggregationoperators. Namely, for two n-ary fuzzy aggregation operators H1 and H2,we write H1 ≤ H2 if H1(x1, x2, . . . , xn) ≤ H2(x1, x2, . . . , xn) holds for allx1, x2, . . . , xn ∈ [0, 1].

Triangular norms and conorms (t-norms and t-conorms for short) are well-known binary fuzzy aggregation operators.

Definition 9. An associative and commutative binary fuzzy aggregation op-erator T is called a t-norm if it satisfies T (x, 1) = x for all x ∈ [0, 1].

Definition 10. An associative and commutative binary fuzzy aggregation op-erator S is called a t-conorm if it satisfies S(x, 0) = x for all x ∈ [0, 1].

Each t-norm T is a conjunctor, and each t-conorm S is a disjunctor. Hence,they can be used to model the fuzzy intersection and union. More precisely,their pointwise extensions can be used for this:

Definition 11. The pointwise extension H of a binary fuzzy aggregation op-erator H is defined as, for all A,B ∈ F(X) and x ∈ X:

H(A,B)(x) = H(A(x), B(x)) (5)

i.e., A ∩T B = T (A,B) and A ∪SB = S(A,B) for all A,B ∈ F(X). Further-more, note that t-norms and t-conorms, as a consequence of their associativ-ity, can easily be generalized to arity n > 2 by recursive application. For arityn = 1, we let each t-norm and t-conorm correspond to the identity mapping.

The minimum TM is the largest t-norm and the drastic product TD, whichis given by

TD(x, y) =

{min(x, y) if max(x, y) = 10 otherwise

(6)

for all x, y ∈ [0, 1], is the smallest t-norm, i.e., TD ≤ T ≤ TM for every t-normT . Other common t-norms are the algebraic product TP and the Lukasiewiczt-norm TL : TP(x, y) = x·y and TL(x, y) = max(0, x+y−1), for all x, y ∈ [0, 1].It can be proven that TL ≤ TP. Hence, TD ≤ TL ≤ TP ≤ TM.

Definition 12. The dual H∗ of an n-ary fuzzy aggregation operator H isdefined as, for all x1, x2, . . . , xn ∈ [0, 1]:

H∗(x1, x2, . . . , xn) = 1 − H(1 − x1, 1 − x2, . . . , 1 − xn) (7)


The dual of a t-norm T is a t-conorm T ∗, and vice versa. One can easily verifythat T ∗

M(x, y) = max(x, y), T ∗P(x, y) = x + y − x · y, T ∗

L(x, y) = min(1, x + y)and

T ∗D(x, y) =

{max(x, y) if min(x, y) = 01 otherwise

(8)

for all x, y ∈ [0, 1]. The ordering is as follows: T ∗M ≤ T ∗

P ≤ T ∗L ≤ T ∗

D.

Fuzzy Similarity Measures

Definition 13. A fuzzy comparison measure is a binary fuzzy relation onF(X), i.e., a fuzzy set in F(X) × F(X).

We consider the following properties of a fuzzy comparison measure M [11]:

M(A,B) = 1 ⇐= A = B (reflexive)M(A,B) = 1 =⇒ A = B (coreflexive)M(A,B) = 1 ⇐= A ⊆ B ∨ B ⊆ A (strong reflexive)M(A,B) = 1 =⇒ A ⊆ B ∨ B ⊆ A (weak coreflexive)M(A,B) = 1 ⇐= A ⊆ B (inclusive)M(A,B) = 1 =⇒ A ⊆ B (coinclusive)M(A,B) = 0 ⇐= A ∩T B = ∅ (∩T -exclusive)M(A,B) = 0 =⇒ A ∩T B = ∅ (∩T -coexclusive)M(A,B) = M(B,A) (symmetric)M(A,B) = M(A/supp A, B/supp A) (left-restrictable)M(A,B) = M(A/supp B , B/supp B) (right-restrictable)M(A,B) ≤ M(A/supp A, B/supp A) (weak left-restrictable)M(A,B) ≤ M(A/supp B , B/supp B) (weak right-restrictable)

for all A,B ∈ F(X), with T a t-norm and C/Y , for C ∈ F(X), the restrictionof C to Y ⊆ X, i.e., C/Y is the Y → [0, 1] mapping that associates C(x) witheach x ∈ Y .

Definition 14. We call a fuzzy comparison measure a fuzzy similarity mea-sure if it is reflexive.

Definition 15. We call a fuzzy similarity measure a fuzzy inclusion measureif it is both inclusive and coinclusive.

Definition 16. We call a fuzzy similarity measure a fuzzy resemblance mea-sure if it is symmetric.


4 A Triparametric Family of Fuzzy Similarity Measures

In [11], we introduced a triparametric family of cardinality-based fuzzy sim-ilarity measures. All measures in this family are instances of a general formthat depends on three parameters:

Definition 17. Let Γ be a binary fuzzy aggregation operator, and let ϕ1 andϕ2 be [0, 1]3 → R mappings that are increasing in their first and second argu-ment. The general form MΓ

ϕ1,ϕ2is given by:

MΓϕ1,ϕ2

(A,B) =ϕ1(||Γ (A,A)||, ||Γ (B,B)||, ||Γ (A,B)||)ϕ2(||Γ (A,A)||, ||Γ (B,B)||, ||Γ (A,B)||)

(9)

for all A,B ∈ F(X), with ||.|| the relative sigma count, i.e., ||A|| = |A|/|X|for each A ∈ F(X).

We proved in the same paper that the following theorems hold:

Theorem 1. Let Γ be an arbitrary fuzzy aggregation operator, and let ϕ1

and ϕ2 be [0, 1]3 → R mappings that are increasing in their first and secondargument. The following implications hold:2

(∀x, y, z ∈ [0, 1])(0 ≤ ϕ1(x, y, z) ≤ ϕ2(x, y, z))=⇒ MΓ

ϕ1,ϕ2is [0, 1]-valued (10)

(∀x ∈ [0, 1])(ϕ1(x, x, x) = ϕ2(x, x, x)) =⇒ MΓϕ1,ϕ2

is reflexive (11)

Theorem 2. Let T be an arbitrary t-norm, and let ϕ1 and ϕ2 be [0, 1]3 → R

mappings that are increasing in their first and second argument. The followingimplications hold:

(∀x, y, z ∈ [0, 1])(min(x, y) ≤ z ≤ max(x, y) =⇒ϕ1(x, y, z) = ϕ2(x, y, z)) =⇒ MT

ϕ1,ϕ2is strong reflexive (12)

(∀x, y, z ∈ [0, 1])(x ≤ z ≤ y =⇒ ϕ1(x, y, z) = ϕ2(x, y, z))=⇒ MT

ϕ1,ϕ2is inclusive (13)

(∀x, y ∈ [0, 1])(ϕ1(x, y, 0) = 0) =⇒ MTϕ1,ϕ2

is ∩T -exclusive (14)

(∀x, y, z ∈ [0, 1])(ϕ1(x, y, z) = ϕ1(y, x, z) ∧ϕ2(x, y, z) = ϕ2(y, x, z)) =⇒ MT

ϕ1,ϕ2is symmetric (15)

(∀x, z ∈ [0, 1])(∀u, v ∈ [0, 1])(ϕ1(x, u, z) = ϕ1(x, v, z) ∧ϕ2(x, u, z) = ϕ2(x, v, z)) =⇒ MT

ϕ1,ϕ2is left-restrictable (16)

(∀ y, z ∈ [0, 1])(∀u, v ∈ [0, 1])(ϕ1(u, y, z) = ϕ1(v, y, z) ∧ϕ2(u, y, z) = ϕ2(v, y, z)) =⇒ MT

ϕ1,ϕ2is right-restrictable (17)

2 A mapping f from a set D to R is [0, 1]-valued if 0 ≤ f(d) ≤ 1 for all d ∈ D.


(∀x, z ∈ [0, 1])(∀u, v ∈ [0, 1])(ϕ1(x, u, z) = ϕ1(x, v, z))=⇒ MT

ϕ1,ϕ2is weak left-restrictable (18)

(∀ y, z ∈ [0, 1])(∀u, v ∈ [0, 1])(ϕ1(u, y, z) = ϕ1(v, y, z))=⇒ MT

ϕ1,ϕ2is weak right-restrictable (19)

Theorem 3. Let ϕ1 and ϕ2 be [0, 1]3 → R mappings that are increasing intheir first and second argument. The following implications hold:

(∀x, y, z ∈ [0, 1])(z ≤ √x · y =⇒ 0 ≤ ϕ1(x, y, z) ≤ ϕ2(x, y, z))

=⇒ MTPϕ1,ϕ2 is [0, 1]-valued

(20)

(∀x, y, z ∈ ]0, 1])(z ≤ √x · y =⇒ ϕ1(x, y, z) > 0)

=⇒ MTPϕ1,ϕ2 is ∩TP

-coexclusive(21)

Theorem 4. Let ϕ1 and ϕ2 be [0, 1]3 → R mappings that are increasing intheir first and second argument. The following implications hold:

(∀x, y, z ∈ [0, 1])(z ≤ min(x, y) =⇒ 0 ≤ ϕ1(x, y, z) ≤ ϕ2(x, y, z))=⇒ MTM

ϕ1,ϕ2 is [0, 1]-valued(22)

(∀x, y, z ∈ [0, 1])(z < max(x, y) ∧ z ≤ min(x, y) =⇒ϕ1(x, y, z) < ϕ2(x, y, z)) =⇒ MTM

ϕ1,ϕ2 is coreflexive(23)

(∀x, y ∈ [0, 1])(ϕ1(x, y,min(x, y)) = ϕ2(x, y,min(x, y)))=⇒ MTM

ϕ1,ϕ2 is strong reflexive(24)

(∀x, y, z ∈ [0, 1])(z < min(x, y) =⇒ ϕ1(x, y, z) < ϕ2(x, y, z))=⇒ MTM

ϕ1,ϕ2 is weak coreflexive(25)

(∀x, y ∈ [0, 1])(x ≤ y =⇒ ϕ1(x, y, x) = ϕ2(x, y, x))=⇒ MTM

ϕ1,ϕ2 is inclusive(26)

(∀x, y, z ∈ [0, 1])(z < x ∧ z ≤ y =⇒ ϕ1(x, y, z) < ϕ2(x, y, z))=⇒ MTM

ϕ1,ϕ2 is coinclusive(27)

(∀x, y, z ∈ ]0, 1])(z ≤ min(x, y) =⇒ ϕ1(x, y, z) > 0)=⇒ MTM

ϕ1,ϕ2 is ∩TM-coexclusive

(28)

For this chapter, we restrict ourselves to the fuzzy similarity measureslisted in Table 1. As indicated in the second and third column, all of thesemeasures are members of the above-mentioned family. It is not hard to seethat the antecedent of (20) is not satisfied for the parameters of M

TP1 , M

TP2 ,

MTP3 and M

TP11 . Therefore, we omitted the expressions of these measures.

Furthermore, note that we used the equality |A∪T∗MB| = |A|+ |B|− |A∩TM

B|to shorten some of the expressions.

Using Theorems 1–4, we can prove properties of the considered fuzzysimilarity measures. Table 2 indicates which properties can be proven in


Table 1. The considered cardinality-based fuzzy similarity measures

φ1(x, y, z) φ2(x, y, z) Γ = TM Γ = TP

MΓ1 z x

|A ∩TMB|

|A|n/a

MΓ2 z y

|A ∩TMB|

|B|n/a

MΓ3 z min(x, y)

|A ∩TMB|

min(|A|, |B|)n/a

MΓ4 z

√x · y

|A ∩TMB|√

|A| · |B||A ∩TP

B|√|A ∩TP

A| · |B ∩TPB|

MΓ5 z

x + y

2

2 |A ∩TMB|

|A| + |B|2 |A ∩TP

B||A ∩TP

A| + |B ∩TPB|

MΓ6 z max(x, y)

|A ∩TMB|

max(|A|, |B|)|A ∩TP

B|max(|A ∩TP

A|, |B ∩TPB|)

MΓ7 z x + y − z

|A ∩TMB|

|A ∪T∗M

B||A ∩TP

B||A ∩TP

A| + |B ∩TPB| − |A ∩TP

B|

MΓ8 min(x, y) x + y − z

min(|A|, |B|)|A ∪T∗

MB|

min(|A ∩TPA|, |B ∩TP

B|)|A ∩TP

A| + |B ∩TPB| − |A ∩TP

B|

MΓ9

√x · y x + y − z

√|A| · |B|

|A ∪T∗M

B|

√|A ∩TP

A| · |B ∩TPB|

|A ∩TPA| + |B ∩TP

B| − |A ∩TPB|

MΓ10

x + y

2x + y − z

|A| + |B|2 |A ∪T∗

MB|

|A ∩TPA| + |B ∩TP

B|2 (|A ∩TP

A| + |B ∩TPB| − |A ∩TP

B|)

MΓ11 max(x, y) x + y − z

max(|A|, |B|)|A ∪T∗

MB|

n/a

this way. We refer to [11] for some example proofs. The advantage of the(weak) restrictable fuzzy similarity measures will be explained further in thischapter. For the other properties, we do not elaborate on their practical use.However, it does not take a genius to see that, depending on the intendedapplication, these remaining properties can be important as well.

5 Fuzzy Audio Similarity Measures

5.1 In General

Henceforth, let F denote the set of all possible audio fragments.

Definition 18. An audio similarity measure is a F × F → R mapping thatassociates with each pair of audio fragments a real number that represents thesimilarity between these fragments.


Table

2.

Pro

per

ties

ofth

eco

nsi

der

edfu

zzy

sim

ilari

tym

easu

res

that

can

be

pro

ven

usi

ng

the

pre

sente

dth

eore

ms

MTM

1M

TM

2M

TM

3M

TM

4M

TP

4M

TM

5M

TP

5M

TM

6M

TP

6M

TM

7M

TP

7M

TM

8M

TP

8M

TM

9M

TP

9M

TM

10

MTP

10

MTM

11

(UV)

��

��

��

��

��

��

��

��

��

(RE)

��

��

��

��

��

��

��

��

��

(CR)

��

��

��

(SR)

��

(WC)

��

��

��

��

��

�

(IN)

��

�(CI)

��

��

��

�(EX)

��

��

��

��

��

�(CE)

��

��

��

��

��

��

��

��

��

(SY)

��

��

��

��

��

��

��

��

(LR)

�(RR)

�(WL)

��

��

��

��

��

�(WR)

��

��

��

��

��

�

(UV)

[0,1]-valu

ed

(RE)

reflexiv

e(CR)

core

flexiv

e(SR)

stro

ng

reflexiv

e(WC)

weak

core

flexiv

e

(IN)

inclu

sive

(CI)

coin

clu

sive

(EX)∩

Γ-e

xclu

sive

(CE)∩

Γ-c

oexclu

sive

(SY)

sym

metr

ic

(LR)

left

-rest

ricta

ble

(RR)

right-

rest

ricta

ble

(WL)

weak

left

-rest

ricta

ble

(WR)

weak

right-

rest

ricta

ble


We use the notation M for the set of all audio similarity measures. As explainedpreviously in this chapter, an audio similarity measure usually consists of twostages. First, a F → R

d mapping is used to extract an d-dimensional featurevector, with d ∈ N \ {0}, from each audio fragment, and then the similaritybetween the two feature vectors is computed by means of a R

d × Rd → R

mapping.

Definition 19. A fuzzy audio similarity measure is a binary fuzzy relationon F, i.e., a F × F → [0, 1] mapping, that associates with each pair of audiofragments a degree of similarity.

Thus, F(F × F) is the set of all fuzzy audio similarity measures. Obviously,we have F(F × F) ⊂ M.

5.2 Based on SHs and FPs

Recall that a fuzzy similarity measure is a F(X) × F(X) → [0, 1] mapping.Hence, if we can identify the feature vectors with fuzzy sets, then a fuzzysimilarity measure can be used to implement the similarity measurement stageof a fuzzy audio similarity measure. We use this approach to construct fuzzyaudio similarity measures based on SHs and FPs. More precisely, we considerthe fuzzy audio similarity measures that compare normalized SHs and FPsusing one of the fuzzy similarity measures listed in Table 1.

Normalization

Since SHs and FPs consist of values from [0,+∞[, they can be converted tofuzzy sets by means of normalization, i.e., dividing each value by the maximumvalue. In practice, normalization is not always required. Namely, one can easilyverify that normalization is not necessary if the fuzzy similarity measure Msatisfies

M(A,B) = M(a ∗ A, b ∗ B) (29)

for all A,B ∈ F(X) and a, b ∈ ]0,+∞[, with c∗C, for (c, C) ∈ ]0,+∞[×F(X),the X → [0,+∞[ mapping defined by: (c ∗ C)(x) = c · C(x), for all x ∈ X.It can easily be proven that (29) holds for M

TP4 . All other considered fuzzy

similarity measures do not satisfy (29). However, if the feature vectors havethe same maximum value, then it is sufficient that the fuzzy similarity measureM satisfies

M(A,B) = M(a ∗ A, a ∗ B) (30)

for all A,B ∈ F(X) and a ∈ ]0,+∞[. Most of the considered fuzzy similaritymeasures satisfy (30), but unfortunately it is not often the case that themaximum values of the feature vectors are equal in practice. In particular,this is generally not true for SHs and FPs.


Restricting Computation Time

In Fig. 1a, white and black depict zero and the maximum value, respectively.Hence, we identify this SH with a fuzzy set A by interpreting black as oneand white as zero. Since a large portion of the figure is white, suppA willcontain considerably less elements than X. This will be the case for most SHs,because the higher loudness levels are rarely reached. When restrictable fuzzysimilarity measures are used for comparing such fuzzy sets, we can restrictthe computation time. For instance, we would normally calculate∑

x∈X

min(A(x), B(x))

∑x∈X

A(x)(31)

to determine the value of MTM1 for A,B ∈ F(X). However, since M

TM1 is

left-restrictable, we obtain the same value by calculating∑x∈ supp A

min(A(x), B(x))

∑x∈ supp A

A(x)(32)

The latter form requires |supp A| comparisons and 2 ·(|supp A| − 1) additions,while the former form needs |X|+ 2 · (|X| − 1) calculations. Hence, the latterform can be calculated substantially faster when suppA contains considerablyless elements than X.

Weak restrictable fuzzy similarity measures can also reduce the compu-tation time in practical applications. For instance, when searching for audiofragments that are very similar to a reference fragment by comparing SHswith the weak left-restrictable measure M

TM7 , we can first calculate the up-

per bound MTM7 (A/supp A, B/supp A). Since we need to find high similarities in

this case, there is no need to do the extra computations required to determineM

TM7 (A,B) when the upper bound M

TM7 (A/supp A, B/supp A) is small. More

concretely, we only need to calculate the right term in the numerator anddenominator of∑

x∈supp A

min(A(x), B(x)) +∑

x∈X\supp A

min(A(x), B(x))

∑x∈supp A

max(A(x), B(x)) +∑

x∈X\supp A

max(A(x), B(x))(33)

if the quotient of the left terms is large enough. In this way, the computationtime can be reduced substantially when there are a lot audio fragments thatare only slightly similar to the reference fragment.



6.1 Evaluation

We evaluate the performance of a given audio similarity measure by examiningthe ordering generated by it when we use it to arrange the audio fragmentsof a test collection according to their similarity with a reference fragment.Formally, we define a test collection as follows:

Definition 20. A test collection is a couple (F,∼), with F ∈ PF(F) and ∼an equivalence relation on F modelling “is very similar to”.

We use the notation T for the set of all possible test collections. Now, supposethat (F,∼) ∈ T. For a reference fragment a ∈ F and an audio similaritymeasure M , we can then use the normalized average rank (NAR) [12] toevaluate the ordering of the elements of F according to their similarity witha, generated by M :

Definition 21. The normalized average rank is the T × M × F → [0, 1] map-ping NAR given by:

NAR((F,∼),M, a) =eval(ranks((F,∼),M, a))

|F | (34)

for all ((F,∼),M, a) ∈ T × M × F, with eval the PF(N \ {0}) → R mappingsuch that, for all N ∈ PF(N \ {0}),

eval(N) =1

|N |

⎡⎣(∑

n∈N

n

)−

⎛⎝ |N |∑

n=1

n

⎞⎠⎤⎦ (35)

and ranks the T × M × F → PF(N \ {0}) mapping given by:

ranks((F,∼),M, a) = {rankF,M,a(b) | b ∈ F ∧ a ∼ b} (36)

for all ((F,∼),M, a) ∈ T × M × F, where rankF,M,a is the F → N \ {0} map-ping that associates with each fragment in F its rank number in the orderingaccording to the similarity with a, generated by M .

The NAR is 0 for perfect performance, and approaches 1 as the performanceworsens. For instance, suppose that F = {a1, a2, b1, b2} is a set of audio frag-ments such that {a1, a2} and {b1, b2} are the equivalence classes of very similarfragments, i.e., ∼ is the equivalence relation on F that satisfies a1 ∼ a2 andb1 ∼ b2. Now, let M be a fuzzy audio similarity measure that generates thefollowing values:

M a1 a2 b1 b2

a1 1 0.9 0.3 0.5a2 0.9 1 0.4 0.8b1 0.3 0.4 1 0.7b2 0.5 0.8 0.7 1


We then obtain the sequence (a1, a2, b2, b1) if we order the elements of Faccording to their similarity with a1, i.e., according to the values that Mgenerates for {a1} × F = {(a1, a1), (a1, a2), (a1, b1), (a1, b2)}. Hence,

ranks((F,∼),M, a1) = {1, 2} (37)

and thus

NAR((F,∼),M, a1) =(1 + 2) − (1 + 2)

4 · 2 = 0 (38)

The NAR is 0 in this case because the obtained ordering is perfect, i.e., allfragments that are very similar to a1 are placed up front. Similarly, we haveNAR((F,∼),M, a2) = NAR((F,∼),M, b1) = 0. For b2, however, we get

ranks((F,∼),M, b2) = {1, 3} (39)

and hence

NAR((F,∼),M, b2) =(1 + 3) − (1 + 2)

4 · 2 = 0.125 (40)

In this case, the NAR is larger than 0 since a2 is placed before b1 when M isused to order the elements of F according to their similarity with b2.

Since the NAR can vary a lot for different reference audio fragments, wecalculate the global NAR (GNAR), which is the arithmetic mean of all NARs:

Definition 22. The global normalized average rank is the T × M → [0, 1]mapping GNAR given by:

GNAR((F,∼),M) =1|F |

∑a∈F

NAR((F,∼),M, a) (41)

for all ((F,∼),M) ∈ T × M.

The smaller the GNAR, the better the performance. For example, the GNARfor the F , ∼ and M considered in the above-mentioned example is equal to(0 + 0 + 0 + 0.125)/4 = 0.03125. This indicates that, for the audio fragmentsin F , the performance of M is very good, but not perfect.

6.2 Test Collection

We used the BEPOP test collection for this chapter.3 This collection consistsof samples of 128 songs that recently appeared in a Belgian hitlist. We ex-tracted three fragments of nine seconds from each sample. Fragments of thesame sample (and hence the same song) are considered very similar, i.e., a ∼ bholds for two audio fragments a and b if a and b are fragments from the samesample.3 http://users.ugent.be/∼klbostee/bepop.


6.3 Results

Figure 3 shows the results of our experiments. We compared the consideredfuzzy audio similarity measures with the Euclidean distance d between the SHsor FPs, interpreted as 1,800-dimensional vectors. Moreover, we also evaluatedthe performance of the Euclidean distance between normalized SHs or FPs.This normalized Euclidean distance is denoted by d′. The difference betweenthe performance of d and d′ turns out to be very small. Namely, d′ performsslightly worse. Hence, we do not gain anything by normalizing the SHs orFPs before taking the Euclidean distance. However, normalized SHs or FPsclearly lead to better results when we compare them with M

TP4 , and the

performance of some of the remaining fuzzy similarity measures is similar tothe performance of d.

Overall, we see that FPs tend to perform better than SHs. A possibleexplanation for this observation is that SHs contain less information, since thehigher loudness levels are rarely reached. Also, rhythm might be more usefulthan timbre to discriminate the songs in the BEPOP collection. Concerningthe choice of scales, there does not appear to be an overall tendency. ForM

TP4 , however, it is clear that the Mel spectrogram leads to slightly better

performance than the sonogram.To conclude this section, we explain why using TP instead of TM appears

to magnify the performance, i.e., TP leads to better performance when TM

performs well, and even worse performance when TM performs badly. Thisobservation can be attributed to the fact that TM is noninteractive. For in-stance, consider the fuzzy sets shown in Fig. 2. In this figure, A, B and C arenormalized FPs. A and B were both derived from a fragment of “You’re beau-tiful” by James Blunt, while C corresponds with a fragment of “Rudebox” byRobbie Williams. Hence, A and B are more similar than A and C. Because ofthe noninteractivity of TM, however, we have |A∩TM

B| ≈ |A∩TMC|, and thus

MTM4 (A,B) < M

TM4 (A,C) since |A| = |A| and |B| > |C|. Hence, M

TM4 gives

counterintuitive results in this case because there is practically no difference

Fig. 2. Example that illustrates the difference between MTM4 and M

TP4 . White

depicts 0, and black represents 1


′

Mel

spe

ctro

gram

−ba

sed

SH

s

Mel

spe

ctro

gram

−ba

sed

FP

s

sono

gram

−ba

sed

SH

s

sono

gram

−ba

sed

FP

s

dd

MTM

1M

TM

2M

TM

3M

TM

4M

TP

4M

TM

5M

TP

5M

TM

6M

TP

6M

TM

7M

TP

7M

TM

8M

TP

8M

TM

9M

TP

9M

TM

10M

TP

10M

TM

11

GNAR

0

0.050.

1

0.150.2

0.25

Fig

.3.

The

GN

AR

for

each

consi

der

edfu

zzy

sim

ilari

tym

easu

re,ca

lcula

ted

for

the

BE

PO

Pte

stco

llec

tion


between A∩TMB and A∩TM

C, as a consequence of the noninteractivity of TM.For TP, however, we can quite clearly notice a difference between A∩TP

B andA ∩TP

C in Fig. 2. In fact, we have |A ∩TPB| > |A ∩TP

C|. This compensates

|B ∩TPB| > |C ∩TP

C| so that MTP4 (A,B) > M

TP4 (A,C).

7 Conclusion

The BEPOP test collection is quite small, and hence we have to be carefulwhen we base conclusions on it. Nevertheless, our experiments do indicatethat fuzzy similarity measures can perform as well as, or even better than,the Euclidean distance for comparing SHs or FPs. In particular, we noticedthat M

TP4 is very suitable for this task. Moreover, this measure does not

require normalization, and its computation time can be restricted in certainpractical applications since it is weak left- and right-restrictable.

Actually, it is not that surprising that MTP4 performs well. After all, one

can easily verify that it corresponds with the cosine similarity measure, whichhas already been used successfully for comparing other types of feature vec-tors (e.g. [2]), apart from the fuzzy framework. However, we explained in theprevious section that M

TP4 can be regarded as an improved version of M

TM4 ,

and that other fuzzy similarity measures can be improved in the same way.This general insight can be considered to be more important than the absoluteperformance of the constructed audio similarity measures.

8 Future Work

We have only scratched the surface of the extensive range of possibilities thatarise when audio feature vectors are identified with fuzzy sets. Obviously,investigating the use of other feature vectors and other fuzzy similarity mea-sures is a possible direction of future research. Furthermore, it would be veryinteresting to examine the influence of the properties of the fuzzy similaritymeasures on the performance of the corresponding audio similarity measures.In any case, it should be worthwhile to conduct more elaborate experimentsto analyse the performance of the obtained fuzzy audio similarity measures.

References

1. Aucouturier J J, Pachet F (2002) Music similarity measures: What’s the use?In: Proceedings of the ISMIR International Conference on Music InformationRetrieval

2. Cooper M, Foote J (2002) Automatic music summarization via similarity analy-sis. In: Proceedings of the ISMIR International Conference on Music InformationRetrieval


3. Pampalk E (2006) Computational models of music similarity and their applica-tion in music information retrieval. PhD thesis, Vienna University of Technology

4. Pampalk E, Dixon S, Widmer G (2003) Exploring music collections by browsingdifferent views. In: Proceedings of the ISMIR International Conference on MusicInformation Retrieval

5. Pampalk E, Rauber A, Merkl D (2002) Content-based organization and visual-ization of music archives. In: Proceedings of the ACM International Conferenceon Multimedia, 570–579

6. Rauber A, Pampalk E, Merkl D (2003) Journal of New Music Research 32:193–210

7. Logan B, Salomon A (2001) A music similarity function based on signal analy-sis. In: Proceedings of the International Conference on Multimedia and Expo,745–748

8. Mandel M, Ellis D (2005) Song-level features and support vector machines formusic classification. In: Proceedings of the ISMIR International Conference onMusic Information Retrieval

9. Pampalk E, Dixon S, Widmer G (2003) On the evaluation of perceptual simi-larity measures for music. In: Proceedings of the International Conference onDigital Audio Effects, 7–12

10. Pampalk E (2003) A Matlab toolbox to compute music similarity from audio.In: Proceedings of the ISMIR International Conference on Music InformationRetrieval

11. Bosteels K, Kerre E E (2007) Fuzzy Sets and Systems 158(22):2466–247912. Muller H, Muller W, McG Squire D, Marchand-Maillet S, Pun T (2001) Pattern

Recognition Letters 22:593–601

Fuzzy Techniques for Text Localisationin Images

Przemys�law Gorecki1, Laura Caponetti2, and Ciro Castiello2

1 Department of Mathematics and Computer Science, University of Warmiaand Mazury, ul. Oczapowskiego 2, 10-719 Olsztyn, [email protected]

2 Department of Computer Science, University of Bari, via E. Orabona, 4-70125Bari, [email protected], [email protected]

Summary. Text information extraction represents a fundamental issue in the con-text of digital image processing. Inside this wide area of research, a number ofspecific tasks can be identified ranging from text detection to text recognition. Inthis chapter, we deal with the particular problem of text localisation, which aims atdetermining the exact location where the text is situated inside a document image.The strict connection between text localisation and image segmentation is high-lighted in the chapter and a review of methods for image segmentation is proposed.Particularly, the benefits coming from the employment of fuzzy and neuro-fuzzytechniques in this field is assessed, thus indicating a way to combine ComputationalIntelligence methods and document image analysis. Three peculiar methods basedon image segmentation are presented to show different applications of fuzzy andneuro-fuzzy techniques in the context of text localisation.

1 Introduction

Text information represents a very important component among the contentsof a digital image. This kind of information is related to the category usuallyreferred to as semantic content. By contrast with perceptual content, relatedto low-level characteristics including colour, intensity or texture, semanticcontent involves recognition of components, such as text, objects or graphicsinside a document image [1–3]. The importance of achieving text informa-tion by means of image analysis is straightforward. In fact, text can be usedto describe the content of a document image, can be converted into elec-tronic formats (for memorisation and archiving purposes), can be exploitedto ultimately understand documents, thus enabling a plethora of applicationsranging from document indexing to information extraction and automaticannotation of documents [4–6]. Additionally, with the increasing use of webdocuments, a lot of multimedia content is available having different page rep-resentation forms, which do not lend easily to automatic analysis. Text standsP. Gorecki et al.: Fuzzy Techniques for Text Localisation in Images, Studies in Computational



234 P. Gorecki et al.

as the most appropriate medium for allowing a suitable analysis of such con-texts, with additional benefits deriving from possible conversions into othermultimedia modalities (such as voice signal), or representations in naturallanguage of the web page contents. The recognition of text in images is a steptowards achieving such a representation [7].

The presence of text inside a digital image can be characterised by differentproperties: text size, alignment, spacing and colour. In particular, text canexhibit varying size, being text dimension an information which cannot be apriori assumed. Also, text alignment and text spacing are relevant propertiesthat can variegate a document appearance in several ways and presumptionsabout horizontal alignment of text can be made only when specific contextsare investigated. Usually text characters tend to have the same (or similar)colours inside an image, however the chromatic visualisation may representsa fundamental property, specially when contrasting colours are employed toenhance text among other image regions.

Automatic methods for text information extraction have been investigatedin a comprehensive way, in order to define different mechanisms that, startingfrom a digital image, could ultimately derive plain text to be memorised orprocessed. By loosely referring to [8], we can define the following steps corre-sponding to the sequential sub-problems which characterise the general textinformation extraction task:

• Text Detection and Localisation. In some circumstances there is no cer-tainty about the presence of text in a digital image, therefore the textdetection step is devoted to the process of determining whether a textregion is present or not inside the image under analysis. In this phaseno proper text information is derived, but only a boolean response to adetection query. This is common when no a priori knowledge about thecharacteristics of an image is available. Once the presence of the text in-side an image has been assessed, the next step is devoted to determiningthe exact location where the text is situated. This phase is often com-bined with different techniques purposely related to the problem of imagesegmentation, thus configuring text regions as specific components to beisolated in digital images.

• Text Tracking. Text tracking represents a support activity correlated tothe previously described step of text localisation whenever the task of textinformation extraction is performed over motion images (such as videos).Even if this kind of process has been frequently overlooked in literature,it could prove its usefulness also to verify the results of the text detectionand localisation steps or to shorten their processing times.

• Text Recognition and Understanding. Text recognition represents the ul-timate step when analysing a digital image with the aim of derivingplain text to be stored or processed. This phase is commonly carried outby means of specific Optical Character Recognition (OCR) technologies.Moreover, text understanding aims to classify text in logical elements, suchas headings, paragraphs, and so on.

Fuzzy Techniques for Text Localisation in Images 235

In this chapter, we are going to address localisation step; the interested readercan be referred to a number of papers directly devoted to the analysis of theother sub-problems [9–15]. Particularly, the additional contribution of thischapter consists in introducing novel text localisation approaches, based onfuzzy segmentation techniques.

When dealing with text localisation we are particularly involved with theproblem of digital image segmentation. The amount and complexity of infor-mation in the images, together with the process of the image digitalisation,lead to a large amount of uncertainty in the image segmentation process. Theadoption of the fuzzy paradigm is desirable in image processing because ofthe uncertainty and imprecision present in images, due to noise, image sam-pling, lightning variations and so on. Fuzzy theory provides a mathematicaltool to deal with the imprecision and ambiguity in an elegant and efficientway. Fuzzy techniques can be applied to different phases of the segmentationprocess; additionally, fuzzy logic allows to represent the knowledge about thegiven problem in terms of linguistic rules with meaningful variables, which isthe most natural way to express and interpret information.

The rest of the chapter is organised as follows. Section 2 is devoted to thepresentation of a brief review of methods for image segmentation, proposingdifferent lines of categorisation. Section 3 introduces some concepts related tofuzzy and neuro-fuzzy techniques, discussing their usefulness in the field ofdigital image processing. Specifically, the particular model of a neuro-fuzzysystem is illustrated: its formalisation is useful for the subsequent presentationcarried on in Sect. 4, where three peculiar text localisation approaches arereported for the sake of illustration. In Sect. 5 the outcomes of experimentalresults are reported and discussed. Section 6 closes the chapter with someconclusive remarks.

2 A Categorisation of Image Segmentation Approaches

Image segmentation is widely acknowledged to play a crucial role in manycomputer vision applications and its relevance in the context of the text lo-calisation process has been already mentioned. In this section we are going todiscuss this peculiar technique in the general field of document image analysis.

Image segmentation represents the first step of document image analysis,with the objective of partitioning a document image into some regions of in-terest. Generally, in this context, image segmentation is also referred to aspage segmentation. High level computer vision tasks, related with text in-formation extraction, often utilise information about regions extracted fromdocument pages. In this sense, the final purpose of page segmentation is toclassify different regions in order to discriminate among text and non-textareas1. Moreover image segmentation is critical, because segmentation results1 Non-text regions may be distinguished as graphics, pictures, background, and so

on (in according with the requirements of the specific problem context).


will affect all subsequent steps of image analysis. In recent years image seg-mentation techniques have been variously applied for the analysis of differenttypes of documents, with the aim of text information extraction [16–21].

Closely related to image segmentation is the problem of feature extraction.The goal is to extract the most salient characteristics of an image for the pur-pose of its segmentation: an effective set of features is one of the requirementfor successful image segmentation. Information in the image, coded directly inpixel intensities, is highly redundant: the major problem here is the number ofvariables involved. Direct transformations of an image f(x, y) of size M × Nto a point in a (M · N)-dimensional space is impractical, due to the numberof dimensions involved. To solve this problem, the image representation mustbe simplified by minimising the number of dimensions needed to describe theimage or some part of it. Therefore, a set of features is extracted from a re-gion of interest in the image. It is common in literature to distinguish betweennatural features, defined by the visual appearance of the image (i.e. intensityof a region), and artificial features, such as intensity histograms, frequencyspectra, or co-occurrence matrices [22]. Moreover, first order statistical fea-tures, second-order statistics, and higher-order statistics can be distinguished,depending on the number of points defining the local feature [23, 24]. In thefirst case, features convey information about intensity distributions, while inthe second case, information about pixel pairs is exploited in order to takeinto account spatial information of the distribution. In the third case, morethan two pixels are considered. The second-order and higher-order featuresare especially useful in describing texture, because they can capture relationsin the repeating patterns, that define visual appearance of a texture.

There is no single segmentation method that provides acceptable resultsfor every type of images. General methods exist, but those which are designedwith particular images often achieve better performance by utilising a priorknowledge about the problem. For our purposes, we are going to discuss pe-culiar segmentation methods by considering two distinct lines of classification(a diagram of the proposed categorisation is reported in Fig. 1). On the one

top-down

bottom-up

region-basedmethods

edge-basedmethods

texture-basedmethods

Fig. 1. The categorisation of the image segmentation approaches


hand, by referring to the working mechanism of the segmentation approaches,it is possible to distinguish three classes: top-down approaches, bottom-up ap-proaches and hybrid approaches. Top-down algorithms start from the wholedocument image and iteratively subdivide it into smaller regions (blocks).The subdivision is based on a homogeneity criterion: the splitting procedurestops when the criterion is met and blocks obtained at this stage constitutethe final segmentation result. Some examples of top-down algorithms are re-ported in [25, 26]. Bottom-up algorithms start from document image pixelsand cluster the pixels into connected components (such as characters). Theprocedure can be iterated giving rise to a growing process which adjoins un-connected adjacent components, in order to cluster higher-order components(such as words, lines, document zones). Typical bottom-up algorithms can befound in [27–30]. Hybrid algorithms can be regarded as a mix of the previousapproaches, thus configuring a procedure which involves both splitting andmerging phases. Hybrid algorithms have been proposed in [31–33].

The second line of classification to categorise segmentation approachesrelies on the features utilised during the process. Methods can be categorisedinto region-based methods, edge-based methods and texture-based methods.In the first case properties such as intensity or colour are used to derive a set offeatures describing regions. Edge-based and texture-based methods, instead,derive a set of local features, concerning not only the analysis of a singlepixel, but also its neighbourhood. In particular, the observation that imagetext regions have textural properties different from background or graphicsrepresents the foundation of texture-based methods. In the following sectionswe discuss in more details the above reported segmentation methods.

2.1 Region-Based Methods

Region-based methods for image segmentation use the colour or grey scaleproperties in a region; when text regions are to be detected, their differenceswith the corresponding properties of the background can be highlighted forthe purpose of text localisation.

The key for region-based segmentation consists in firstly devising suitablemethods for partitioning an image in a number of connected components, ac-cording to some specific homogeneity criteria to be applied during the imagefeature analysis. Once obtained the initial subdivision of the image into a gridof connected regions, an iterative grouping process of similar regions is startedin order to update the partition of the image. In this way, it is possible to createa final segmentation of regions which are meant to be purposely classified. Itshould be observed that the term “grouping” is used here in a loose sense.We intend to address a process which could originate an incremental or decre-mental assemblage of regions, with reference to region growing (bottom-up)methods, region splitting (top-down) methods and split-and-merge (hybrid)methods.


The analysis of the image features can be performed on the basis of dif-ferent techniques: among them, thresholding represents one of the simplestmethods for segmentation. In some images, an object can be easily separatedfrom the background if the intensity levels of the object fall outside the rangeof intensity levels of the background. This represents a perfect case for apply-ing a thresholding approach. Each pixel of the input image f(x, y) is comparedwith the threshold t in order to produce the segmented image l(x, y):

l(x, y) ={

1 if f(x, y) > t (object),0 if f(x, y) ≤ t (background). (1)

The selection of an appropriate threshold value is essential in this tech-nique. Many authors have proposed to find the threshold value by means of animage histogram shape analysis [34–37]. Global thresholding techniques use afixed threshold for all pixels in the image and therefore work well only if theintensity histogram of the objects and background are well separated. Hence,these kind of techniques cannot deal with images containing, for example, astrong illumination gradient. On the other hand, local adaptive thresholdingselects an individual threshold for each pixel based on the range of inten-sity values in its local neighbourhood. This allows for thresholding of an im-age whose global intensity histogram does not contain distinctive peaks [38].Thresholding approach has been successfully applied in many image segmen-tation problems with the goal of text localisation [39–41].

Clustering can be seen as a generalisation of the thresholding technique.In fact, it allows for partitioning data into more than two clusters dealingwith a space of higher dimensionality than thresholding, where data are one-dimensional. Similarly to thresholding, clustering is performed in the imagefeature space, and it aims at finding structures in the collection of data, so thatdata can be classified into different groups (clusters). More precisely, data arepartitioned into different subsets and data in each subset are similar in someway. During the clustering process, structures in data are discovered withoutany a priori knowledge and without providing an explanation or interpretationwhy they exist [42]. Clustering techniques for image segmentation have beenadopted for the purpose of text localisation [43–45].

2.2 Edge-Based Methods

Edge-based techniques, rather than finding regions by adopting a groupingprocess, aim at identifying explicit or implicit boundaries between regions.

Edge-based methods represent the earliest segmentation approaches andrely on the process for edge detection. The goal of edge detection is to localisethe points in the image where abrupt changes in intensity take place. In thedocument images, edges may appear on discontinuity points between the textand the background. The simplest mechanism to detect edges is the differentialdetection approach. As the images are two-dimensional, the gradient ∇ iscalculated from the partial derivatives of the image f(x, y):


∇f(x, y) =

[∂f(x,y)

∂x∂f(x,y)

∂y

]. (2)

The computations of the partial derivatives are usually realised by convolvingthe image with a given filter, which estimates the gradient. The maps ofedge points obtained at the end of this process can be successively utilisedby an edge tracking technique, so that the contour of different regions maybe highlighted inside the image. Generally, the Canny operator, one of themost powerful edge filter, can be applied to detect edge points in documentimages [46].

In case of text localisation, edge-based methods aim at exploiting the highcontrast between the text and the background. The edges of text boundaryare identified and merged and then several heuristics are used to filter out thenon-text regions [47–49].

2.3 Texture-Based Methods

Texture-based methods consider a document image as a composite of tex-tures of different classes. With this approach various texture segmentationand classification techniques can be used directly or with some modifications.Some texture segmentation approaches apply splitting and merging or clus-tering methods to the feature vectors computed for the image and describingits texture information. When a document image is considered as texture,text regions are assumed to have texture features different from the non-textones. Text regions are modelled as regular periodic textures, because theycontain text lines with the same orientation. Also their interline spacings areapproximately the same. Instead non-text regions correspond to irregular tex-tures. Generally, the problem is how to separate two or more different textureclasses. Techniques based on Gabor filters, Wavelet, FFT, spatial variance canbe used to detect the textural properties of an image text region [50–52]. Inthe following, we describe two fundamental approaches as Gabor filtering andmulti-scale techniques.

Gabor Filtering

Gabor filtering is a classical approach to describe textural properties of an im-age. A two-dimensional Gabor filter is a complex sinusoid (with a wavelengthλ and a phase offset ψ) modulated by a two-dimensional Gaussian function(with an aspect ratio of γ). The Gabor filter, that has an orientation θ, isdefined as following:

G(x, y) = exp(−x′2 + γ2y′2

2σ2) cos(2π

x′

λ+ ψ), (3)

where x′ = x cos θ + y sin θ and y′ = −x sin θ + y cos θ.


In the context of text extraction, a filter bank consisting of severalorientation-selective 2-D Gabor filters can be used to detect texture featuresof text and non-text components. As an illustrative example, in [53] the Gabortransform with m different spatial frequencies and p different orientations isapplied to the input image by producing mp filtered images. A texture featureis computed as the mean value in small overlapping windows centred at eachpixel. The values of each pixel in n features images form an n-dimensional fea-tures vector. These vectors are grouped into K clusters using a squared-errorclustering algorithm.

Multi-Scale Techniques

One problem associated with document texture based approaches is due toboth large intra-class and inter-class variations in textural features. To solvethis problem multi-scale analysis and features extraction at different scaleshave been introduced by some authors [54,55]. In [56], Wavelet decompositionis used to define local energy variations in the image at several scales. Binaryimage, which is obtained by thresholding the local energy variation, is analysedby connected component-based filtering using geometric attributes such as sizeand aspect ratio. All the text regions, which are detected at several scales, aremerged to give the final result.

Wavelet packet analysis is an important generalisation of Wavelet analysis[57, 58]. Wavelet packet functions are localisable in space such as Waveletfunctions, but offer more flexibility in decomposition of signals. Wavelet packetapproximators are based on translated and scaled Wavelet packet functionsWj,b,k, which are generated from the base function [59], according to thefollowing equation:

Wj,b,k(t) = 2j/2Wb(2−j(t − k)), (4)

where j is the resolution level, Wb is the Wavelet packet function generatedby scaling and translating a mother Wavelet function, b is the number ofoscillations (zero crossings) of Wb and k is the translation shift. In Waveletpacket analysis, a signal x(t) is represented as a sum of orthogonal Waveletpacket functions Wj,b,k(t) at different scales, oscillations and locations:

x(t) =∑

j

∑b

∑k

wj,b,kWj,b,k(t). (5)

where each wj,b,k is a Wavelet packet coefficient. To compute the Waveletpacket coefficients a fast splitting algorithm [60] is used, which is an adap-tation of the pyramid algorithm [61] for the discrete Wavelet transform. Thesplitting algorithm differs from the pyramid algorithm by the fact that bothlow-pass (L) and high-pass (H) filters are applied to the detailed coefficients,in addition to the approximation coefficients, at each stage of the algorithm.Moreover, the splitting algorithm retains all the coefficients, including thoseat intermediate filtering stages.


The Wavelet packet decomposition process can be represented with aquadtree in which the root node is assigned to the highest scale coefficients,that are the original image itself, while the leaves represent outputs of theLL, LH, HL and HH filters. Assuming that similar regions of an image havesimilar frequency characteristics, we infer that these characteristics are cap-tured by some nodes of the quadtree. As a consequence, the proper selectionof quadtree nodes should allow for localisation of similar regions in the image.Learning based methods are proposed for the automatic selection of nodesdescribing text or background as we will illustrate in Sect. 4.3.

3 Fuzzy Techniques in Image Segmentation

In the previous section, we have discussed different techniques for image seg-mentation. Some of the feature extraction methods and most of the algorithmsare based on crisp relations, comparisons and thresholding. Such constraintsare not well suited to cope with ambiguity and imprecision present in theimages, which are very often degraded by noise coming from various sourcessuch as imperfect capturing devices, image digitalisation and sampling. Fuzzytechniques provide a mathematical tool to deal with such imprecision andambiguities in an elegant and efficient way, allowing to eliminate some of thedrawbacks of classical segmentation algorithms. Additionally, the hybrid ap-proach based on integration of fuzzy logic and neural networks proved to bevery fruitful. This hybridisation strategy allows to combine the benefits ofboth methods while eliminating their drawbacks. Neuro-fuzzy networks canbe trained in a similar fashion as classical neural networks, but they are alsocapable of explaining the decision process by representing the knowledge interms of fuzzy rules. Moreover, the rules can be discovered automatically fromdata and their parameters can be easily fine tuned in order to maximise theclassification accuracy of the system.

Neuro-fuzzy hybridisation belongs to the research field of ComputationalIntelligence, that is an emerging area in the field of intelligent systems devel-opment. This novel paradigm results from a partnership of different method-ologies: Neural Computation, Fuzzy Logic, Evolutionary Programming. Sucha consortium is employed to cope the imprecision of real world applications,allowing the achievement of robustness, low solution cost and a better rapportwith reality [62,63]. In this section, we introduce the basics of fuzzy theory andneuro-fuzzy hybridisation, while discussing their relevance and application inthe context of image analysis.

3.1 General Theory of Fuzzy Sets

The incentive for the development of fuzzy logic originates from observingthat people do not require precise, numerical information in order to describeevents or facts, but rather they do it by using imprecise and fuzzy linguistic


terms. Yet, they are able to draw the right conclusions from fuzzy information.The theory of fuzzy sets, underpinning the mechanisms of fuzzy logic, wasintroduced to deal mathematically with imprecise or vague information thatis present in everyday life [64].

In the bi-valued logic, any relation can be either true or false, which isdefined by the crisp criteria of membership. For example, it is easy to deter-mine precisely whether a variable x is greater than a certain number. On theother hand, evaluating whether x is much greater than a certain number isambiguous. In the same way, when looking at a digital document image, wecan say that the background is bright and the letters are dark. We are ableto identify the above classes, despite of the lack of precise definitions for thewords “bright” and “dark”: this question relies on the assumption that manyobjects do not have clear criteria of membership. Fuzzy logic allows to handlesuch situations, by introducing continuous intermediate states between trueand false. This allows also to represent numerical variables in terms of linguis-tic labels. Actually, the mean for dealing with such linguistic imprecision isthe concept of fuzzy set, which permits gradual degree of membership of anobject in relation to a set.

Let X denote a universe of discourse, or space of points, with its elementsdenoted as x. A fuzzy set A is defined as a set of ordered pairs:

A = {(x, µA(x)) | x ∈ X}, (6)

where µA(x) is the membership function of A:

µA : X → [0, 1], (7)

representing the degree of membership of x in A. A single pair (x, µ(x)) iscalled fuzzy singleton, thus a fuzzy set can be defined in terms of the union ofits singletons. Based on the above definitions, an ordinary set can be derived byimposing the crisp membership condition µA(x) ∈ {0, 1}. Graphical examplesof crisp and fuzzy sets are shown in Fig. 2.

Analogously, it is possible to extend operators of ordinary sets to theirfuzzy counterparts, giving rise to fuzzy extension of relations, definition and

Fig. 2. An example of a crisp set and a fuzzy set with Gaussian membership function


so on [65,66]. In the following, we shall review different fuzzy image features,which are employed in the field of digital image processing. Moreover, we areinterested in dealing with the peculiar aspects of fuzzy clustering and thedefinition of fuzzy and neuro-fuzzy systems.

3.2 Fuzzy Image Features

An M × N image f(x, y) can be represented as an array of fuzzy singletons,denoting pixel grey level intensities. However, due to the imprecise imageformation process, it is more convenient to treat the pixel intensity (or someother image feature, such as edge intensity) as a fuzzy number, having non-singleton membership function, rather than a crisp number (corresponding tothe fuzzy singleton).

A fuzzy number is a fuzzy set defining a fuzzy interval for a real number,with the membership function that is piecewise continuous. One way for ex-pressing fuzzy numbers is by means of triangular fuzzy sets. A triangular fuzzynumber is defined as A = (a1, a2, a3), where a1 ≤ a2 ≤ a3 are the numbersdescribing a shape of a triangular membership function:

µA(x) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0 x < a1,x−a1a2−a1

a1 ≤ x < a2,

1 x = a2,a3−xa3−a2

a2 < x ≤ a3,

0 x > a3.

(8)

Fuzzy numbers can be applied to incorporate imprecision into image statis-tics (i.e. histograms). This allows to improve the noise invariance of this kindof features, which is especially important in some situations where the imagestatistics are derived from small regions, so that the number of observationsis small.

Fuzzy Histogram

A crisp histogram represents the distribution of pixel intensities in the imageto a certain number of bins, hence it is reports the probability of observinga pixel with a given intensity. In order to obtain the histogram, the intensityvalue of each pixel in the image is accumulated in the bin corresponding to thisvalue. In this way, for an image containing n pixels, a histogram representationH = {h(1), h(2), . . . , h(b)} can be obtained, comprising a number of b bins.Therefore h(i) = ni/n denotes the probability that a pixel belongs to the i-thintensity bin, where ni is the number of pixels in the i-th bin. However, asthe measurements of the intensities are imprecise, each accumulated intensityshould also affect the nearby bins, introducing a fuzziness in the histogram.The value of each bin in a fuzzy histogram represents a typicality of the pixel


within the image rather than its probability. The fuzzy histogram can bedefined as FH = {fh(1), . . . , fh(b)} where fh(i) is expressed as following:

fh(i) =n∑

j=1

µj(i), i = 1, . . . , b, (9)

where b is the number of bins (corresponding to the number of intensity lev-els), n is the number of pixels in the image and µj(i) is the membership degreeof the intensity level of the j-th pixel with respect to the i-th bin. Therefore,µj(i) denotes the membership function of a fuzzy number, related to the valueof the pixel intensity. The value fh(i) can be expressed as the linear convo-lution between the conventional histogram and the filtering kernel providedby the function µj(i). This approach is possible if all fuzzy numbers have themembership function of the same shape. Hence, the membership function µl ofa fuzzy number corresponding to a crisp intensity level l, can be expressed asµl(x) = µ(x− l), where µ denotes the general membership function, commonto all fuzzy numbers accumulated in the histogram. By representing µ as aconvolution kernel, the fuzzy histogram FH = {fh(1), . . . , fh(b)} is smoothedas following:

fh(i) = (h ∗ µ)(i) =∑

l

h(i + l)µ(l), i = 1, . . . , b, (10)

where h(i) denotes the i-th bin of a crisp histogram.In [67] such a smoothing based approach, where the influence from neigh-

bouring bins is expressed by triangular membership functions, has been usedto extract fuzzy histograms of grey images.

Fuzzy Co-occurrence Matrix

Fuzzy co-occurrence matrix is another example of fuzzifying the crisp fea-ture measure. Similarly to the second-order statistic, it is often employed formeasuring the texture features of the images. The idea of the classical co-occurrence matrix is to accumulate in the matrix C the co-occurrences ofthe intensity values i = f(xi, yi) and j = f(xj , yj) of the pixels (xi, yi) and(xj , yj), given the spatial offset (δx, δy) separating the pixels. Therefore, thespatial co-occurrence of the intensities i and j will be accumulated in the binC(i, j) of the matrix, by increasing the value of the bin by one.

In the case of fuzzy co-occurrence matrix F , intensity vales of pixels (xi, yi)and (xj , yj) are represented with fuzzy numbers having the membership func-tions µi(x) and µj(x). Thus, not only the bin (i, j) should be incremented,but also its neighbour bins. However, the amount of the increment ∆F (k, l)for the bin F (k, l) should depend on the fulfilment degrees of membershipfunctions µi(k) and µj(l) and the increment is calculated as following:

∆F (k, l) = µi(k)µj(l). (11)


Similarly to the fuzzy histogram, a fuzzy co-occurrence matrix can beobtained from a crisp co-occurrence matrix by means of the convolution oper-ator. However, as the matrix is two-dimensional, the convolution is performedfirst along its rows, and then along its columns.

3.3 Fuzzy Systems

Fuzzy systems are designed to cope with imprecision of the input and outputvariables by defining fuzzy numbers and fuzzy sets that can be expressedby linguistic variables. The working scheme of a fuzzy system is based on aparticular inference mechanism where the involved variables are characterisedby a number of fuzzy sets with meaningful labels. For example, a pixel greyvalue can be described using the {“bright”,“grey”,“dark”} fuzzy sets, an edgecan be characterised by the {“weak”,“strong”} fuzzy sets, and so on.

In detail, each fuzzy system is designed to tackle a decision problem bymeans of a set of N fuzzy rules, called fuzzy rule base R. The rules incorporatea number of fuzzy sets whose membership functions are usually designed byexperts in the field of the problem at hand. The j-th fuzzy rule in a fuzzy rulebase R has the general form:

Rj : If x1 is Aj1 and x2 is Aj

2 and . . . xn is Ajn then y is Bj , j = 1, 2, . . . , N,

(12)where x = (x1, x2, . . . , xn) is an input vector, y is an output value and Aj

i

and Bj are fuzzy sets. In order to infer the output from a crisp input, thefirst step is to fuzzify input values. This is achieved by evaluating a degree ofmembership in each of the fuzzy sets, describing the variable.

The overall process of fuzzy inference is articulated in consecutive steps[68]. At first, a fuzzification of input values is needed, in order to infer the out-put from a crisp input. This is achieved by evaluating a degree of membershipin each of the fuzzy sets. In this way, an expression for the relation of the j-thrule can be found. By interpreting the rule implication by a conjunction-basedrepresentation2, it is possible to express the relation of the j-th rule as follows:

µRj (x1, x2, . . . , xn, y) = µAj1(x1) ∧ µAj

2(x2) ∧ . . . µAj

n(xn) ∧ µBj (y), (13)

where ∧ denotes the operator generalising the fuzzy “AND” connective. Theaggregation of all fuzzy rules in the rule base is achieved by:

µR′(x1, x2, . . . , xn, y) =N∨

j=1

µRj (x1, x2, . . . , xn, y), (14)

2 This kind of interpretation for an IF-THEN rule assimilates the rule with theCartesian product of the input/output variable space. Such an interpretation iscommonly adopted, like in the cases of Mamdani [69] and Takagi-Sugeno-Kang(TSK) models [70], but it does not represent the only kind of semantics for fuzzyrules [71].


where ∨ is the operator generalising the fuzzy “OR” connective, and µR′ is amembership function characterising a fuzzy output variable.

The last step of the process is defuzzification, which assigns appropriatecrisp value to the fuzzy set R′ described by the membership function (14), suchthat an output crisp value is provided at the end of the inference process. Forselecting this value, different defuzzification operators can be employed [72],among them: the centre of area (evaluating the centroid of the fuzzy outputmembership), the mean - smallest, largest - of maxima, (evaluating the mean -smallest or largest - of all maximum points of the membership function).

No standard techniques are applicable for transforming the human knowl-edge into a set of rules and membership functions. Usually, the first step is toidentify and name the system inputs and outputs. Then, their value rangesshould be specified and a fuzzy partition of each input and output should bemade. The final step is the construction of the rule base and the specificationof the membership functions for the fuzzy sets.

As an illustrative example, we show how fuzzy systems can be employedto obtain a simple process of text information extraction. Let us consider theproblem of a decision task, based on classification of small image blocks astext or background. By examining the blocks extracted from the image, it canbe observed that the background is usually bright, with little or no variationsin grey-scale. On the other hand, text contains high variations in grey-scale, asthe block contains black text pixels and white background pixels, or it is blackwith small grey-scale variance (in case of larger headings fonts). The aboveobservations allow to formulate a set of rules, containing linguistic variables,with the employment of such features as the mean and the standard deviationof pixel values:

• R1: IF mean is dark AND std. dev. is low THEN background is low.• R2: IF mean is dark AND std. dev. is high THEN background is low.• R3: IF mean is grey AND std. dev. is high THEN background is low.• R4: IF mean is white AND std. dev. is low THEN background is high.

The foregoing simple rules allow us to infer the membership degree bi of thei-th block to the background class, while the membership degree ti to the textclass can be obtained simply as: ti = 1 − bi.

In order to obtain the segmentation of a document image, this should bepartitioned into regular grid of small blocks (i.e. with size of 4×4 or 8×8 pix-els, depending on the size of the image). Successively, fuzzy rules are evaluatedbased on the features of each block. Figure 3 illustrates the sets of member-ship functions defined for the input values. Figure 4 illustrates the inferenceprocess for a sample input value: each row corresponds to one of the rules inthe rule base previously described, with two input membership functions andone output membership function. Degrees of membership (vertical lines) arecalculated based on illustrative crisp inputs (mean = 193, std. dev. = 32). Theactivation function of each rule is calculated by adopting the min function,according to (13). Finally, all activation functions are aggregated using the


(a) (b)

Fig. 3. Membership functions of the variables mean(a) and std. dev. (b) employedfor segmentation of document images

Fig. 4. Fuzzy inference process performed over illustrative input values

max function, according to (14). The crisp value (equal to 0.714, as shownin Fig. 4) is calculated by defuzzifying the output value, employing the centreof area method. Results obtained by employing this approach on a sampledocument image are presented in Fig. 5.

3.4 Fuzzy C-Means Clustering

Traditional clustering approaches generate partitions where each pattern be-longs to one and only one cluster. Fuzzy clustering extends this notion usingthe concept of membership function. In this way, the output of this kindof fuzzy algorithms is a clustering but not a partition. The Fuzzy C-Meansmethod of clustering was developed by Dunn in [73] and improved by Bezdekin [74], and it is frequently used in data clustering problems. The Fuzzy C-Means (FCM) is a partitional method, that is derived from the K-Meansclustering [75]. The main difference between FCM and K-Means is that theformer allows for one piece of data to belong to many clusters with certainmembership degrees. In other words, the partitioning of the data is fuzzyrather than crisp.

Given the number of clusters m, the distance metric d(x, y) and an ob-jective function J , the goal is to assign the samples {xi}k

i=1 into clusters.


(a) (b)

Fig. 5. Document image segmentation with employment of a fuzzy system. Originaldocument image (a), obtained segmentation (b)

In particular, the Fuzzy C-Means algorithm is based on minimisation of thefollowing objective function:

Js =m∑

j=1

k∑i=1

(uij)sd(xi, cj)2, 1 < s < ∞, (15)

where the distance metric d(x, y) is represented by any norm expressing thesimilarity between the measured data and the centres (most frequently, theEuclidean distance); s is the parameter determining the fuzziness of cluster-ing; m is the number of clusters; k is the number of observations; uij is themembership degree of observation xi belonging to a cluster cj , calculated asfollows:

uij =1

m∑l=1

(d(xi, cj)d(xi, cl)

) 2s−1

. (16)

The values of membership degrees are constrained to be positive and theysatisfy the constraint

∑mj=1 uij = 1.

It should be observed that the Fuzzy C-Means does not incorporate anyspatial dependences between the observations, which may degrade the overallsegmentation results, because the obtained homogeneous regions are likelyto be disjoint, irregular and noisy. However, it is possible to penalise theobjective function (15) in order to restrict the membership functions in FCMto be spatially smooth. This penalty is used to discourage spatially undesirableconfigurations of membership values, i.e. high membership values surrounded


by low membership values of the same cluster, or adjacent high membershipvalues of different clusters. Examples of such penalised objective function wereproposed in [76].

The Fuzzy C-Means method has been applied in a variety of image seg-mentation problems, such as medical imaging [77] or remote sensing [78].

3.5 Neuro-Fuzzy Systems

Integration of fuzzy logic and neural networks boasts a consolidated presencein scientific literature [79–83]. The motivations behind the success of this kindof combination can be easily assessed by referring to the issues introduced inthe previous section. In fact, by means of fuzzy logic it is possible to facilitatethe understanding of decision processes and to provide a natural way for theinterpretation of linguistic rules. On the other hand, rules in fuzzy systemscannot be acquired automatically. The designing process of rules and mem-bership functions is always human-driven and reveals to be difficult, especiallyin case of complex systems. Additionally, tuning of fuzzy membership func-tions representing linguistic labels is a very time consuming process, but it isessential if accuracy is a matter of concern [84].

Neural networks are characterised by somewhat opposite properties. Theyhave the ability to generalise and to learn from data, obtaining knowledgeto deal with previously unseen patterns. The learning process is relativelyslow for large sets of training data, and any additional information about theproblem cannot be integrated into the learning procedure in order to simplifyit and speed up the computation. Trained neural network can classify patternsaccurately, but the decision process is obscure for the user. In fact, informationis encoded in the connections between the neurons, therefore extraction ofstructural knowledge from the neural network is very difficult.

Neuro-fuzzy systems allow to extract fuzzy rules from data during theknowledge discovery process. Moreover, the membership functions inside eachrule can be easily tuned, based on information embedded in data. In orderto perform both tasks, the expert intervention can be avoided by resorting toneural learning and a training set T of t samples is required. In particular,the i-th sample in the training set is a pair of input/output vectors (xi,yi),therefore T = {(x1,y1), . . . , (xt,yt)}. In case of classification problems, theinput vector xi is an m-dimensional vector containing the m measurementsof the input features, while the output vector yi is an n-dimensional binaryvector, codifying the membership of xi for each of the n classes (i.e., yi is oneof the linearly independent basis vectors spanning the R

n space).In the following, we are going to introduce the peculiar scheme of a neuro-

fuzzy model, whose application in text localisation problems will be detailedin the next section.


A Peculiar Scheme for a Neuro-Fuzzy System

The fuzzy component of the neuro-fuzzy system is represented by a partic-ular fuzzy inference mechanism whose general scheme is comparable to theTakagi-Sugeno-Kang (TSK) fuzzy inference method [70]. The fuzzy rule baseis composed by K fuzzy rules, where the k-th rule is expressed in the form:

Rk : If x1 is A(k)1 and . . . and xm is A

(k)m then y1 is b

(k)1 and . . . and yn is b

(k)n ,

(17)

where x = (x1, . . . , xm) is the input vector, y = (y1, . . . , yn) is the outputvector, (A(k)

1 , . . . , A(k)m ) are fuzzy sets defined over the elements of the input

vector x, and (b(k)1 , . . . , b

(k)n ) are fuzzy singletons defined over the elements

of the output vector y. Each of the fuzzy sets A(k)i is defined in terms of a

Gaussian membership function µ(k)i :

µ(k)i (xi) = exp

⎛⎝−xi − c

(k)i

2σ(k)i

2

⎞⎠ , (18)

where c(k)i is the centre and σ

(k)i is the width of the Gaussian function. The

rule fulfilment degree of the k-th rule is evaluated using the formula:

µ(k)(x) =m∏

i=1

µ(k)i (xi), (19)

where the product function is employed to interpret the AND connective. Thefinal output of the fuzzy model can be expressed as:

yj =

∑Kk=1 µ(k)(x)b(k)

j∑Kk=1 µ(k)(x)

, j = 1, . . . , n. (20)

In classification tasks, the elements of the output vector y express in therange [0, 1] the membership degrees of the input pattern for each of the classes.In order to obtain a binary output vector y′ = {y′

j}nj=1, the defuzzification of

the output vector y is performed as follows:

y′j =

{1 if yj = max(y),0 otherwise. (21)

By means of (21), the input pattern is classified in according with the highestmembership degree.

The neural component of the neuro-fuzzy system is represented by a par-ticular neural network which reflects in its topology the structure of the pre-viously presented fuzzy inference system. The network is composed by fourlayers with the following characteristics:


Layer 1 provides the crisp input vector x = (x1, . . . , xm) to the network. Thislayers does not perform any calculation and the input vector values aresimply passed to the second layer.

Layer 2 realises a fuzzification of the input variables. Units in this layer areorganised into K distinctive groups. Each group is associated with one ofthe fuzzy rules, and it is composed of m units, corresponding to the mfuzzy sets in the fuzzy rule. The i-th unit in the k-th group, connectedwith the i-th neuron in layer 1, evaluates the Gaussian membership degreeof the fuzzy set A

(k)i , according to (18).

Layer 3 is composed of K units. Each of them performs the preconditionmatching of one of the rules and reports its fulfilment degree, in accordancewith (19). The i-th unit in this layer is connected with all units in the i-thgroup of layer 2.

Layer 4 supplies the final output vector y and is composed of n units. The i-thunit in this layer evaluates the element yi, according to (20). In particular,the fulfilment degrees of the rules are weighted by the fuzzy singletons,which are encoded as the values of the connections weights between layer3 and layer 4.

Figure 6 depicts the structure of the above described neuro-fuzzy network,with reference to a neuro-fuzzy system with two inputs, three rules and twooutputs.

Fig. 6. Structure of the neuro-fuzzy network coupled with a neuro-fuzzy systemexhibiting two inputs, three rules and two outputs (m = 2, K = 3, n = 2)


As concerning the learning procedure of the neuro-fuzzy network, two dis-tinctive steps are involved. The first one is devoted to discovering the initialstructure of the neuro-fuzzy network. Successively, the parameters of the fuzzyrules are refined, so that the overall classification accuracy is improved. Duringthe first step, a clustering of the input data is performed by an unsupervisedlearning process of the neuro-fuzzy network: each cluster corresponds to one ofthe nodes in the rule layer of the neuro-fuzzy network. The clustering processis able to derive the proper number of clusters. In fact, a rival penalised mecha-nism is employed to adaptively determine the suitable structure of the networkand therefore the number of fuzzy rules (starting from a guessed number). Inthis way, an initial knowledge is extracted from data and expressed in the formof a base of rules. The obtained knowledge is successively refined during thesecond step, where a supervised learning process of the neuro-fuzzy networkis accomplished (based on a gradient descent technique), in order to attunethe parameters of the fuzzy rule base to the numerical data. For the sakeof conciseness, we omit further mathematical details concerning the learningalgorithms, addressing the reader to [85].

4 Text Localisation: Illustrative Applications

As previously stated, the different techniques for image segmentation presentsome drawbacks. Classical top-down approaches, based on run-length encod-ing and projection profiles, are sensitive to skewed text and perform well onlywith highly structured page layouts. On the contrary, bottom-up approachesare sensitive to font size, scanning resolution, interline and inter-characterspacing.

To overcome these problems, the employment of Computational Intelli-gence methods would be beneficial. Here we detail some of our experimentswith the employment of fuzzy and neuro-fuzzy techniques. With reference tothe classification directions proposed in this chapter, the first approach we aregoing to introduce can be classified as a region-based approach, which standsas a preliminary naive formulation of our research activity [86]. The involvedimage regions are classified as text or graphic regions, on the basis of theirappearance (regularity) and shape. The classification process is realised byemploying the peculiar neuro-fuzzy model described in Sect. 3.5.

The second approach proposed is somewhat more involved and it is re-lated to a multi-resolution segmentation scheme, belonging to the categoryof edge-based bottom-up approaches [87]. Here pixels are classified as text,graphics, or background, in accordance with their grey-level intensity andedge strength values, extracted from different resolution levels. In order toimprove the segmentation results obtained from the initial pixel level classifi-cation phase, a region level analysis phase is performed. Both steps, namelypixel level analysis and region level analysis, are realised by the employmentof the already mentioned neuro-fuzzy methodology.


The third approach, representing an example of texture-based bottom-upapproach, is based on a more sophisticated tool for multi-resolution analysiswith Discrete Wavelet Packet Transform [88]. To discriminate between textand non-text regions, the image is transformed into a Wavelet packet analysistree. Successively, the feature image, exploited for the segmentation of text andnon-text regions, is obtained from some of the nodes selected from the quad-tree. The most discriminative nodes are derived using an optimality criterionand a genetic algorithm. Finally, the obtained feature image is segmented bymeans of a Fuzzy C-Means clustering.

All the proposed segmentation approaches have been evaluated using theDocument Image Database available from the University of Oulu [89]. Thisdatabase includes 233 images of articles, scanned from magazines and news-papers, books and manuals. The images vary both in quality and contents:some of them contain text paragraphs only (with Latin and Cyrillic fonts ofdifferent sizes), while others contain mixtures of text, pictures, photos, graphsand charts. Moreover, not all the documents are characterised by regular(Manhattan) page layout.

4.1 Text Region Classification by a Neuro-Fuzzy Approach

The idea at this stage is to exploit a neuro-fuzzy classifier to label the differentregions composing a document image. The work assumes that a database ofsegmented images is available, from which it is possible to extract a set ofnumerical features.

The first step is a feature extraction process and consists in detectingthe skew angle φ of each region as the dominant orientation of the straightlines passing through that region. Inside the text regions, being composed ofcharacters and words, the direction of the text lines will be highly regular.This regularity can be captured by means of the Hough transform [22,90–92].Particularly, the skew angle is detected as the angle for which the Houghtransform of a specific region has the maximum value.

The retrieved skew angle φ is used to obtain the projection profile of thedocument region. The profile is calculated by accumulating pixel values inthe region along its skew angle, so that the one-dimensional projection vectorvp is obtained. The elements of vp codify the information about the spatialstructure of the analysed region. For a text region, vp should have regular, highfrequency sinusoidal-like shape with peaks and valleys corresponding to thetext lines and the interline spacings, respectively. In contrast, such regularitiescannot be observed, when a graphics region is considered. To measure theregularity of the vp vector, the Power Spectral Density (PSD) [22] analysis isperformed. Actually, for large paragraphs of text, the PSD coefficients showa significant peak around the frequency value corresponding approximatelyto the number of text lines in this region. For graphic regions, instead, the


(a) (b) (c)

Fig. 7. A region of a document image (a), its projection profile calculated for skewangle of 90 degrees (b) and PSD spectrum of the profile (c)

spectrum presents only a few peaks (one or two) around the lowest frequencyvalues. A vector vpsd of PSD coefficients is calculated as follows:

vpsd = |FT (vp)|2, (22)

where FT (·) denotes the Fourier Transform [93]. An illustrative projectionprofile and its PSD spectrum for a sample text region is presented in Fig. 7.

Generally, the number of the components of the PSD spectrum vector vpsd

is too large to be directly used as a feature vector for the classification tasks.In order to reduce the dimensionality of vpsd, it can be divided into a numberof intervals. In particular, we considered some intervals of different lengths,corresponding to the scaled Fibonacci sequence, with multiplying factor equaltwo (i.e., 2, 4, 6, 10, 16, 26, 42, . . .). In this way, we are able to preserve and toexploit mostly of the information accumulated in the first part of the PSDspectrum. For each interval, the maximum value of vpsd is derived, and theobtained maxima (normalised with respect to the highest one) represent thefirst seven components of the feature vector vf , which will be employed inthe successive region classification stage. To increase classification accuracy,statistical information concerning the connectivity of the analysed region isextracted, thus extending the feature number of the vector vpsd. At the end ofthe overall feature extraction process, every region of the segmented documentimage is represented as a feature vector vf with ten elements, which are usedfor the classification purposes.

The final step is the classification of the regions described in terms of thefeature vector vf . Such a classification process has been performed by meansof the neuro-fuzzy system introduced in Sect. 3.5. In the course of the exper-imental session concerning the image region classification, the input vectorx, involved in the fuzzy inference model, corresponds to the ten-dimensional


feature vector vf , derived during the feature extraction process. The outputvector y is related to the classes of the classification tasks (i.e., textual andgraphical regions).

The overall algorithm can be summarised as follows:

For each region:

1. Calculate skew angle θ by means of Hough transform2. Obtain projection profile vp of the region along θ3. Calculate vpsd from vp

4. Obtain vf by dividing vpsd into intervals5. Classify the region as text or graphics on the basis of vf by means of the

neuro-fuzzy inference

4.2 Text Localisation by a Neuro-Fuzzy Segmentation

The idea at this stage consists in exploiting a neuro-fuzzy classifier for achiev-ing both the segmentation of a document image and the final labelling ofthe derived regions. The described work is related to an edge-based approachfor document segmentation, aiming at the identification of text, graphic andbackground regions. The overall methodology is based on the execution oftwo successive steps, working at different levels, configuring a bottom-up ap-proach. In particular, an edge-based pre-processing step concerns a pixel levelanalysis, devoted to a preliminary classification of each image pixel into oneof the previously described general classes. From the results of this phase,coherent regions are obtained by a merging procedure. To refine the obtainedsegmentation, an additional post-processing is performed at region level, onthe basis of shape regularity and skew angle analysis. This post-processingphase is beneficial for obtaining a final accurate segmentation of the documentimage. The peculiarity of the proposed approach relies on the employment ofthe neuro-fuzzy system both in the pre-processing pixel level analysis and inthe post-processing region level refinement.

Low-Level Pixel Analysis

The aim of the low level pixel analysis is to classify each pixel of a documentimage f(x, y) into text, background or graphic category, according to its greylevel and edge strength values. When extracting features from image data,the type of information that can be obtained may be strongly dependent onthe scales at which the feature detectors are applied [94]. This can be percep-tually verified with ease: when an image is viewed from near to far, the edgestrength of a pixel is decreased in general, but the relative decreasing ratesfor contour, regular and texture points are different. Moving from this kindof observation, we followed a multi-scale analysis of the image: assuming that


an image f(x, y) is given, let R be the number of scale representations consid-ered for our analysis. In this way, a set of images {f (1)(x, y), . . . , f (R)(x, y)} isinvolved and an edge map e(x, y) can be obtained from each image by meansof the Sobel operator [22]. Since the information extracted from image datais strongly dependent on the image scale at which the feature detectors areapplied, we have represented the images f(x, y) and e(x, y) as Gaussian pyra-mids with R different resolution levels. In the pyramid, image at level r +1 isgenerated from image at level r by means of down-sampling by a factor of 2.Therefore, a set of edge maps {e(1)(x, y), . . . , e(R)(x, y)} is generated duringthe creation of the pyramids and associated to the set of multi-scaled images.

By examining the luminosity and edge strength information of the image atdifferent resolution levels, it is possible to formulate a set of rules that enablesthe pixel classification. In this way, a pixel (x, y) is characterised by a featurevector of length 2R, containing information about intensity and edge strengthat different resolution levels. Such a feature vector vxy can be formalised as:

vxy = ((f (1)(x, y), f (2)(x/2, y/2), . . . , f (R)(x/2R−1, y/2R−1), (23)

e(1)(x, y), e(2)(x/2, y/2), . . . , e(R)(x/2R−1, y/2R−1)).

In order to derive a set of applicable rules encoding accurate information, weexploited the neuro-fuzzy system introduced in Sect. 3.5, which automaticallyderives a fuzzy rule base from a training set of manually labelled pixels. Inthis case, the neuro-fuzzy network consists of 2R inputs (corresponding to theelements of the vector vxy), while three output classes correspond to each ofthe recognised category of pixel (text, background, graphic).

The obtained fuzzy rule base is applied to perform the pixel classificationprocess, which ultimately produces three binary images: btex(x, y), bgra(x, y)and bbac(x, y). The images are composed by pixel candidates of text, graphicand background regions, respectively. In order to obtain more coherent re-gions, a merging procedure is applied to each of the binary images, on thebasis of a set of predefined morphological operations (including well-knowntechniques of image processing, such as erosion, dilation, hole filling [95]).

High-Level Region Analysis

The high-level region analysis is purposed to provide a refinement of the textinformation extraction process. In other words, this step aims at detecting andcorrecting misclassified text regions identified during the previous analysis. Todo that, the shape properties of every text region are analysed as follows. Byexamining the image btex, containing text regions, we can firstly extract anumber of connected components {Et}T

t=1 representing the text regions to beanalysed. Particularly, we are interested in processing the images composedby the pixels representing the perimeter of each region Et. Each of them ismapped by the Hough transform from spatial coordinates of Et(x, y) to polarcoordinates of Ht(d, θ), where d denotes the distance from line to the origin,


and θ ∈ 〈0, π) is the angle between this line and x axis. The one-dimensionalfunction

h(θ) = maxd

Ht(d, θ), (24)

(which is applied for each value of θ), contains information about the angles ofthe most dominant lines in the region Et. In general, for a rectangular region,with a skew angle of α degrees, the plot of h(θ) has two significant maximumvalues located at: {

θ1 = α degreesθ2 = α + 90 degrees, (25)

corresponding to the principal axes of the region. The presence or absenceof such maxima is exploited to classify each text region as rectangular ornon-rectangular, respectively.

To obtain a set of linguistic rules suitable for this novel classification task,the neuro-fuzzy model adopted for classifying the image pixels is employedonce again. In this case, the input vector x can be defined in terms of 20elements, which synthetically describe the information content of h(θ). Par-ticularly, the normalised values of h(θ) have been divided into 20 intervalsof equal lengths, and the elements of x represent the mean values of h(θ) ineach interval. The number of the intervals has been empirically selected as acompromise between the length of the input vector (thus, a complexity of theneuro-fuzzy network structure) and the amount of information required forfollowing classification task (accuracy of a classification). Moreover, h(µ) hasbeen normalized, as the amplitude of the function carry information about thesize of the region, which is irrelevant in this particular case and would ham-per the classification process. The region Et under analysis can be ultimatelyclassified in one of two possible output classes: non-rectangular shape (in thiscase Et is definitively labelled as graphic region) and rectangular shape. Thislatter case opens the way for an analysis performed over the skew angle value.In particular, the skew angle αt of a region Et is chosen as the minimum anglevalue θ1t (see (25)), while the overall skew angle φ of the document is chosenas the most often occurring skew angle along all rectangular regions. Succes-sively, simple thresholding is applied: if |αt − φ| is greater than some smallangle β, then the rectangular region Et is re-classified as a graphic region; oth-erwise, Et retains its original text classification. Finally, graphic regions arerecursively enlarged by bounding boxes surrounding them, which are alignedaccording to φ.

The overall proposed algorithm can be summarised as follows:

For an input document image f(x, y):

1. Create a Gaussian pyramid of {f (1)(x, y), . . . , f (R)(x, y)}.2. For each level f (i)(x, y) of a pyramid, apply Sobel operator to calculate

its edge image e(i)(x, y).3. Classify each pixel of the image as text graphics or background according

to values of luminosity and edge strength in the pyramid. Create three


binary images: btex(x, y), bgra(x, y) and bbac(x, y) according to the classi-fication results.

4. Process btex(x, y) and bgra(x, y): median filter, apply dilation, removesmall holes from the regions, apply erosion.

5. For each connected component Et in btex obtain its perimeter (by remov-ing interior pixels) and calculate its skew angle αt. Additionally, classifyEt as rectangular or non-rectangular.

6. Calculate a histogram containing skew angles of connected componentsclassified as rectangular. The most occurring value is chosen as an overallskew angle θ.

7. For each connected component Et: if it non-rectangular or it is not alignedwith an overall skew angle, then reclassify it as a graphics region:btex(x, y) = btex(x, y) ∧ ¬Et(x, y), bgra(x, y) = bgra(x, y) ∨ Et(x, y).

8. Enlarge graphics regions in bgra with bounding boxes aligned to θ.9. Setthebinary imageofabackgroundasbbac(x, y)=¬ (btex(x, y)∨bgra(x, y)).

4.3 Text Localisation by Wavelet Packet Segmentation

In this section we propose our methodology for document page segmentationinto text and non-text regions based on Discrete Wavelet Packet Transforms.This approach represents an extension of the work presented in Sect. 4.2, whichis based on the Gaussian image pyramids. In fact, two-dimensional Waveletanalysis is a more sophisticated tool for multi-resolution analysis, if comparedto the image pyramids.

The main concern of the methodology is the automatic selection of packetWavelet coefficients describing text or background regions. Wavelet packetdecomposition acts as a set of band-pass filters, allowing to localise frequenciesin the image much better than standard Wavelet decomposition. The goal ofthe proposed feature extraction process is to obtain a basis for the Waveletsub-bands, that exhibit the highest discrimination power between text andnon-text regions. This stage is realised by the analysis of the quadtree obtainedby applying the Wavelet packet transform to a given image. In particular,the most discriminative nodes are selected among all the nodes {ci}|τ |i=1 inthe quadtree τ , where |τ | =

∑d−1j=0 22j is the total number of all nodes in the

quadtree having depth d. This process is based on ground truth segmentationdata.

Coefficient Extraction

Given an image f(x, y), the initial step consists in decomposing it usingWavelet packet transform, so that the quadtree τ of Wavelet coefficients isobtained. An example of the decomposition is depicted in Fig. 8, where the


(a) (b)

(c)

Fig. 8. DWPT decomposition of the image (a) at levels 1–2 (b–c). Each subimagein (b–c) is a different node of the DWPT tree

coefficients of the nodes at each decomposition level are displayed as sub-images. By visually analysing the figure, it can be observed that some of thesub-images appear to be more discriminating between text and non-text areas.

To quantitatively evaluate the effectiveness of the node ci ∈ τ (associatedwith the matrix of Wavelet coefficients) in discriminating between text andnon-text, the following procedure is performed. At first, the Wavelet coeffi-cients ci are represented in terms of absolute values |ci|, because discrimina-tion power does not depend on the coefficient signs. Then, the coefficients aredivided into the sets Ti (text coefficients) and Ni (non-text coefficients), onthe basis of the known ground truth segmentation of the image f(x, y).


For each set Ti and Ni, the mean and variance values are calculated,denoted as µT

i and σTi for text and µN

i and σNi for non-text, respectively.

After that, the discrimination power Fi of the node ci is evaluated using thefollowing optimality criterion, based on the Fisher’s criterion [96]:

Fi =(µT

i − µNi )2

σTi + σN

i

. (26)

To a certain extent, Fi measures the signal-to-noise ratio in the text andnon-text classes. The nodes with maximum inter-class distance and minimumintra-class variance have the highest discrimination power.

The simplest approach to obtain the best set of nodes, denoted as υ ⊂ τ , isto select the smallest number of nodes which have the highest discriminationpower. Then, a feature image f ′(x, y) can obtained from the selected nodesυ. In particular, the Wavelet coefficients of the set υ are rescaled to the sizeof image f(x, y) and then added together:

f ′(x, y) =∑i∈|υ|

c′i(x, y), (27)

where c′i(x, y) denotes the |ci| values rescaled to match the size of the originalimage f(x, y). Even if the approach for obtaining υ is fast and simple, it isnot an optimal technique to maximise signal-to-noise ratio between text andnon-text classes. Moreover, the optimal number of nodes to be chosen for υis unknown and it must selected manually.

The problem of selecting the best nodes from all nodes available is a com-binatorial problem, producing an exponential explosion of possible solutions.We propose to solve this problem by employing a genetic algorithm [97, 98].In particular, each node ci ∈ τ is associated with a binary weight wi ∈ {0, 1},so the tree τ is associated with a vector of weights W = [w1, . . . , wi, . . . , w|τ |].Consequently, the subset of the best nodes is defined as υ = {ci ∈ τ : wi = 1}.

Given a weights vector W of the nodes, the feature image f ′ is calculatedas following:

f ′(x, y,W ) =|τ |∑i=1

wic′i(x, y). (28)

The discrimination power F of the subset υ can be computed extendingthe (26), by evaluating the mean values µT , µN and the deviation values σT ,σN of the values in the feature image f ′ corresponding to text regions (Tsuperscript) and non-text regions (N superscript):

F =(µT − µN )2

σT + σN. (29)

To find the optimal subset υ by means of 28), a genetic algorithm is appliedin order to maximise the cost function F . Initially a random population of K


weight vectors {Wi : i = 1, . . . ,K}, represented as binary strings, is created.Successively, for each weight vector the feature image is calculated and itscost function is evaluated using the (29). The best individuals are subject tocrossover and mutation operators in order to produce the next generation ofweight vectors. Finally, the optimal subset υ is found from the best individualsin the evolved vector population.

Finally, the feature image f ′(x, y) is obtained from merging the set ofcoefficients in the nodes υ, as described in (27) or (28).


To test the effectiveness of the presented methodology, we have employeda publicly available document image database [89]. In particular, the pre-liminary region-based approach we firstly presented has been tested on 306graphic regions and 894 text regions which have been extracted from the data-base and automatically labelled. The extracted feature vectors were dividedinto a training set composed of 900 samples and a testing set composed of theremaining 300 observations. Proportions between text and graphics regionswere preserved in both the datasets.

A set of 12 fuzzy rules have been extracted from the training set by meansof the unsupervised neuro-fuzzy learning procedure previously detailed. Suc-cessively, the rules have been refined using the gradient-descend technique ofback-propagation. Table 1 reports the classification accuracy over the train-ing and testing set produced by the neuro-fuzzy system, both for initial andrefined base of rules. Classification results are satisfactory in terms of the ac-curacy. However, the most common error is the misclassification of short (one,or two lines of text) text regions, as can be observed also in Fig. 9. The mainreason for that is the insufficient regularity in the projection profiles of suchregions. Nevertheless, the strong points of the proposed method rely on theability to process skewed documents, and the invariance to font shape andfont size.

The second approach proposed has been tested using 40 images related tomagazines and newspapers, drawn from the Oulu document image database.For the purpose of pixel classification, a three level Gaussian pyramid was builtfrom the original image. From the knowledge extraction process performed by

Table 1. Overall classification accuracy of the document regions

Number ofrules

% classification

Training set Test set

Initial fuzzy rule base 12 95.71 93.53

Refined fuzzy rule base 12 95.80 93.60


(a) (b)

Fig. 9. Classification results obtained for two sample images. Dark regions havebeen classified as text, while light regions have been classified as graphics

Table 2. Pixel level classification accuracy

Data set Text (%) Graphics (%) Background (%)

Training 91.54 85.42 93.33Testing 91.54 86.05 95.66

the neuro-fuzzy system over a pre-compiled training set, a fuzzy rule basecomprising 12 rules has been obtained. Table 2 reports the accuracy of thepixel classification process (considering both a training and a testing set); theclassification results for an illustrative image from the database, are presentedin Fig. 10.

The further application of the neuro-fuzzy system, during the high-levelanalysis, was performed over a pre-compiled training set including the featurevector information related to 150 regions. The obtained rule base comprises10 fuzzy rules and its classification accuracy is reported in Table 3, consid-ering both training and testing sets. The final segmentation results for thepreviously considered sample image are presented in the Fig. 11.

The accuracy of the method can be quantitatively measured using a groundtruth knowledge deriving from the correct segmentation of the 40 imagesemployed. The effectiveness of the overall process is expressed by a measureof segmentation accuracy Pa, defined as:

Pa =Number of correctly segmented pixels

Number of all pixels∗ 100%. (30)

Table 4 reports the mean values of segmentation accuracy obtained over theentire set of analysed images, distinguishing among the different methodology


(a) (b) (c)

(d)

Fig. 10. Classification of the pixels of an input image (a) into text (b), graphics (c)and background (d) classes

Table 3. Region level classification accuracy

Data set Rectangular (%) Non rectangular (%)

Training 97.43 92.85Testing 94.11 93.93

steps. The apparently poor results obtained at the end of the pixel classifi-cation step are due to the improper identification of text regions (only thepixels corresponding to the words are classified as text). The effectiveness ofthe initial stage of pixel classification is demonstrated by the rapid increaseof the accuracy values achieved in the subsequent merging process.

The quantitative measure of the segmentation accuracy allows for com-parison with other existing techniques. As an example, we can compare theresults illustrated in Table 4 with those reported in [17], where a polynomialspline Wavelet approach has been proposed and the same kind of measure hasbeen employed to quantify the overall accuracy. Particularly, the best resultsin [17] achieved an accuracy of 98.29%. Although our methodology producedslightly lower accuracy results, it should be observed that we analysed a total


(a) (b) (c)

(d)

Fig. 11. Final segmentation of a sample figure (a) into text (a), graphics (b) andbackground (c) regions

Table 4. Overall segmentation accuracy expressed in terms of Pa. “PC” and “MO”stand for Pixel Classification and Morphological Operation, respectively

Text Graphics (%) Bckgr (%) Image (%)

PC 59.92 88.32 52.93 50.59PC + MO 96.65 90.63 93.26 90.27Final 98.19 96.36 97.99 97.51

number of 40 images, instead of the 6 images considered in [17]. Finally, itcan be noted that our approach may be extended to colour documents usingthe HSV system [22]. In this case, the Gaussian pyramid could be evaluatedfor the H and S components and the edge information for the V components.

The texture-based approach lastly presented has been tested on 40 imagesextracted from the Oulu database: in order to obtain the feature images, eachimage has been decomposed by Daubechies db2 Wavelet functions [59] in threelevel coefficients. One of these document images has been manually segmented,to create ground truth segmentation data. The best nodes have been selectedby means of a basic genetic algorithm [97, 98] with an initial population of


20 weight vectors. New generations of vector population have been producedby crossover (80%) and mutation operator (20%). After 50 generations, thebest subset of nodes has been obtained, containing 39 out of all 85 nodes.Additionally, it should be noted that more than one image can be combinedinto one larger image for the purpose of the node selection.

Using the selected nodes, the feature images f ′(x, y) have been evaluatedfor each considered image. Then, we applied the Fuzzy C-Means algorithm[74] to each image f ′(x, y), in order to group its pixels into two clusters,corresponding to text and non-text regions. The final segmented image hasbeen obtained by replacing each pixel of f ′(x, y) with its cluster label. Asthe clustering is not performed in the image space but in the feature space,additional post processing is necessary to refine segmentation. In particular,a median filter is applied to remove small noisy regions, while preserving theedges of larger regions. Successively, a morphological closing is applied on thefiltered image, in order to merge nearby text regions (i.e. letters, words, textlines) into larger ones (i.e. paragraphs, columns). Figure 12 shows an exampleof feature image, obtained from a document page, and its final segmentation.

The percentage of segmentation accuracy has been evaluated by the mea-sure of segmentation accuracy Pa previously described. For this purpose, theground truth segmentation of each image has been obtained automatically,according to the additional information in the database. Moreover, to testthe robustness of the method against page skew, some of the images havebeen randomly rotated. The obtained segmentation accuracy has an averagevalue of 92.63% presenting the highest value of 97.18% and the lowest valueof 84.37%. Some results are shown in Fig. 13. The results are comparable withother state-of-art document image segmentation techniques. Once again, wereport as an example that the best resulted obtained in [17] is 98.29% (overonly 6 images considered). Anyway, the approach proves to be robust againstpage skew and provides good results when dealing with images presentingdifferent font sizes and style.

(a) (b) (c)

Fig. 12. Document image (a), its corresponding feature image (b) and segmentationresult (c)


(a) (b) (c)

Fig. 13. Segmentation results. Segmentation of the document image (a), invarianceto page skew (b) and invariance to font changes (c)

6 Conclusions

Text information represents a very important component among the contentsof a digital image. The importance of achieving text information by meansof image analysis is straightforward. In fact, text can be variously used todescribe the content of a document image, and it can be converted into elec-tronic format (for memorisation and archiving purposes). In particular, differ-ent steps can be isolated corresponding to the sequential sub-problems whichcharacterise the overall text information extraction task. In this chapter, weaddressed the specific problem connected with text localisation.

The peculiarity of the present work consists in discussing text localisationmethods based on the employment of fuzzy techniques. When dealing withtext localisation, we are particularly involved with the problem of digital imagesegmentation and the adoption of the fuzzy paradigm is desirable in sucha research field. That is due to the uncertainty and imprecision present inimages, deriving from noise, image sampling, lightning variations and so on.Fuzzy theory provides a mathematical tool to deal with the imprecision andambiguity in an elegant and efficient way. Fuzzy techniques can be applied todifferent phases of the segmentation process; additionally, fuzzy logic allowsto represent the knowledge about the given problem in terms of linguisticrules with meaningful variables, which is the most natural way to express andinterpret information.

After reviewing a number of classical image segmentation methods, weprovided a presentation of fuzzy techniques which commonly find applicationin the context of digital image processing. Particularly, we showed the bene-fits coming from the fruitful integration of fuzzy logic and neural computationand we introduced a particular model for a neuro-fuzzy system. By doing so,we indicated a way to combine Computational Intelligence methods and doc-ument image analysis. Actually, a number of research works of ours have been


illustrated as examples of applications of fuzzy and neuro-fuzzy techniques fortext localisation in images.

The presentation of the research works is intended to focus the interest ofthe reader on the possibilities of these innovative methods, which are by nomeans exhausted with the hints provided in this chapter. In fact, a number offuture research lines can be addressed, ranging from the analysis of differentimage features (such as colour), to the direct application of ComputationalIntelligence mechanisms to deal with the large amount of web image contents.

References

1. Colombo C, Del Bimbo A, Pala P (1999) IEEE Multimedia 6(3):38–532. Long F, Zhang H, Feng D (2003) Fundamentals of content-based image retr-

ieval, in: Feng D ZHE Siu WC (ed.) Multimedia information retrieval andmanagement - technological fundamentals and applications. Springer, BerlinHeidelberg New York

3. Yang M, Kriegman D, Ahuja N (2002) IEEE Trans Pattern Anal Mach Intell24(1):34–58

4. Dingli A, Ciravegna F, Wilks Y (2003) Automatic semantic annotation usingunsupervised information extraction and integration, in: Proceedings of semAn-not workshop

5. Djioua B, Flores JG, Blais A, Descles JP, Guibert G, Jackiewicz A, Priol FL,Nait-Baha L, Sauzay B (2006) EXCOM: An automatic annotation Engine forsemantic information, in: Proceedings of FLAIRS conference, pp. 285–290

6. Orasan C (2005) Automatic annotation of corpora for text summarisation: Acomparative study, in: Computational linguistics and intelligent text processing,volume 3406/2005, Springer, Berlin Heidelberg New York

7. Karatzas D, Antonacopoulos A (2003) Two Approaches for Text Segmentationin Web Images, in: Proceedings of the 7th International Conference on Docu-ment Analsis and Recognition (ICDAR2003), IEEE Computer Society Press,Cambridge, UK pp. 131–136

8. Jung K, Kim K, Jain A (2004) Pattern Recognit 37:977–9979. Chen D, Odobez J, Bourlard H (2002) Text segmentation and recognition in

complex background based on Markov random field, in: Proceedings of Interna-tional Conference on Pattern Recognition, pp. 227–230

10. Li H, Doerman D, Kia O (2000) IEEE Trans Image Process 9(1):147–15611. Li H, Doermann D (2000) Superresolution-based enhancement of text in digital

video, in: Proceedings of International Conference of Pattern Recognition, pp.847–850

12. Li H, Kia O, Doermann D (1999) Text enhancement in digital video, in: Pro-ceedings of SPIE, Document Recognition IV, pp. 1–8

13. Sato T, Kanade T, Hughes E, Smith M (1998) Video OCR for digital newsarchive, in: Proceedings of IEEE Workshop on Content based Access of Imageand Video Databases, pp. 52–60

14. Zhou J, Lopresti D, Lei Z (1997) OCR for world wide web images, in: Proceed-ings of SPIE on Document Recognition IV, pp. 58–66


15. Zhou J, Lopresti D, Tasdizen T (1998) Finding text in color images, in: Pro-ceedings of SPIE on Document Recognition V, pp. 130–140

16. Ching-Yu Y, Tsai WH (2000) Signal Process.: Image Commun. 15(9):781–79717. Deng S, Lati S, Regentova E (2001) Document segmentation using polynomial

spline wavelets, Pattern Recognition 34:2533–254518. Lu Y, Shridhar M (1996) Character segmentation in handwritten words, J. of,

Pattern Recognit 29(1):77–9619. Mital D, Leng GW (1995) J Microcomput Appl 18(4):375–39220. Rossant F (2002) Pattern Recognit Lett 23(10):1129–114121. Xiao Y, Yan H (2003) Text extraction in document images based on Delaunay

triangulation, Pattern Recognition 36(3):799–80922. Pratt W (2001) Digital image processing (3rd edition). Wiley, New York, NY23. Haralick R (1979) Proc IEEE 67:786–80424. Haralick R, Shanmugam K, Dinstein I (1973) Textural features for image clas-

sification, IEEE Trans Syst Man Cybern 3:610–62125. Baird H, Jones S, Fortune S (1990) Image segmentation by shape-directed cov-

ers, in: Proceedings of International Conference on Pattern Recognition, pp.820–825

26. Nagy G, Seth S, Viswanathan M (1992) Method of searching and extractingtext information from drawings, Computer 25:10–22

27. O’Gorman L (1993) IEEE Trans Pattern Anal Mach Intell 15:1162–117328. Kose K, Sato A, Iwata M (1998) Comput Vis Image Underst 70:370–38229. Wahl F, Wong K, Casey R (1982) Graph Models Image Process 20:375–39030. Jain A, Yu B (1998) IEEE Trans Pattern Anal Mach Intell 20:294–30831. Pavlidis T, Zhou J (1992) Graph Models Image Process 54:484–49632. Hadjar K, Hitz O, Ingold R (2001) Newspaper Page Decomposition Using a

Split and Merge Approach, in: Proceedings of Sixth International Conferenceon Document Analysis and Recognition

33. Jiming L, Tang Y, Suen C (1997) Pattern Recognit 30(8):1265–127834. Rosenfeld A, la Torre PD (1983) IEEE Trans Syst Man Cybern SMC-13:231–23535. Sahasrabudhe S, Gupta K (1992) Comput Vis Image Underst 56:55–6536. Sezan M (1985) Graph Models Image Process 29:47–5937. Yanni M, Horne E (1994) A new approach to dynamic thresholding, in: Pro-

ceedings of EUSIPCO’94: 9th European Conference on Signal Processing 1, pp.34–44

38. Sezgin M, Sankur B (2004) J Electron Imaging 13(1):146–16539. Kamel M, Zhao A (1993) Graph Models Image Process 55(3):203–21740. Solihin Y, Leedham C (1999) Integral ratio: A new class of global thresholding

techniques for handwriting images, in: IEEE Transactions on Pattern Analysisand Machine Intelligence PAMI-21, pp. 761–768

41. Trier O, Jain A (1995) Goal-directed evaluation of binarization methods, in:IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-17, pp.1191–1201

42. Bow ST (2002) Pattern Recognition and Image Preprocessing 2nd edition.Dekker, New York, NY

43. Jung K, Han J (2004) Pattern Recognit Lett 25(6):679–69944. Ohya J, Shio A, Akamatsu S (1994) IEEE Trans Pattern Anal Mach Intell

16(2):214–22445. Wu S, Amin A (2003) Proceedings of Seventh international conference on Doc-

ument Analysis and Recognition, volume 1, pp. 493–497


46. Canny J (1986) IEEE Trans Pattern Anal Mach Intell 8(6):679–69847. Chen D, Shearer K, Bourlard H (2001) Text enhancement with asymmetric filter

for video OCR, in: Proceedings of International Conference on Image Analysisand Processing, pp. 192–197

48. Hasan Y, Karam L (2000) IEEE Trans Image Process 9(11):1978–198349. Lee SW, Lee DJ, Park HS (1996) IEEE Trans Pattern Recogn Mach Intell

18(10):1045–105050. Grigorescu SE, Petkov N, Kruizinga P (2002) IEEE Trans Image Process

11(10):1160–116751. Livens S, Scheunders P, van de Wouwer G, Van Dyck D (1997) Wavelets for tex-

ture analysis, an overview, in: Proceedings of the Sixth International Conferenceon Image Processing and Its Applications, pp. 581–585

52. Tuceryan M, Jain AK (1998) Texture analysis, in: Chen CH, Pau LF, WangPSP (eds.) The Handbook of Pattern Recognition and Computer Vision 2ndedition, World Scientific Publishing, River Edge, NJ pp. 207–248

53. Jain A, Bhattacharjee S (1992) Mach Vision Appl 5:169–18454. Acharyya M, Kundu M (2002) IEEE Trans Circ Syst video Technol 12(12):

1117–112755. Etemad K, Doermann D, Chellappa R (1997) IEEE Trans Pattern Anal Mach

Intell 19(1):92–9656. Mao W, Chung F, Lanm K, Siu W (2002) Hybrid Chinese/English text detection

in images and video frames, in: Proceedings of International Conference onPattern recognition, volume 3, pp. 1015–1018

57. Coifman R, Wickerhauser V (1992) IEEE Trans Inf Theory 38(2):713–71858. Coifman RR (1990) Wavelet Analysis and Signal Processing, in: Auslander L,

Kailath T, Mitter SK (eds.) Signal Processing, Part I: Signal Processing Theory,Springer, Berlin Heidelberg New York, pp. 59–68, URL {citeseer.is-t}.psu.edu/coifman92wavelet.html

59. Daubechies I (1992) Ten Lectures on Wavelets (CBMS - NSF Regional Confer-ence Series in Applied Mathematics), Soc for Industrial & Applied Math

60. Bruce A, Gao HY (1996) Applied Wavelet Analysis with S-Plus, Springer, BerlinHeidelberg New York

61. Mallat SG (1989) IEEE Trans Pattern Anal Mach Intell 11(7):674–69362. Engelbrecht A (2003) Computational Intelligence: An Introduction, WileyNew

York, NY63. Sincak P, Vascak J (eds.) (2000) Quo vadis computational intelligence?, Physica-

Verlag64. Zadeh L (1965) Inform Control 8:338–35365. Klir G, Yuan B (eds.) (1996) Fuzzy sets, fuzzy logic, and fuzzy systems: selected

papers by Lotfi A. Zadeh, World Scientific Publishing, River Edge, NJ66. Pham T, Chen G (eds.) (2000) Introduction to Fuzzy Sets, Fuzzy Logic, and

Fuzzy Control Systems, CRC , Boca Raton, FL67. Jawahar C, Ray A (1996) IEEE Signal Process Lett 3(8):225–22768. Jin Y (2003) Advanced Fuzzy Systems Design and Applications, Physica/

Springer, Heidelberg69. Mamdani E, Assilian S (1975) Int J Man-Mach Studies 7(1):1–1370. Sugeno M, Kang G (1988) Structure identification of fuzzy model, Fuzzy Sets

Syst 28:15–3371. Dubois D, Prade H (1996) Fuzzy Sets Syst 84:169–185


72. Leekwijck W, Kerre E (1999) Fuzzy Sets Syst 108(2):159–17873. Dunn J (1974) J Cybern 3:32–5774. Bezdek J (1981) Pattern Recognition with Fuzzy Objective Function Algorithms

(Advanced Applications in Pattern Recognition), Springer, Berlin HeidelbergNew York URL http://www.amazon.co.uk/exec/obidos/ASIN/0306406713/

citeulike-2175. Macqueen J (1967) Some methods of classification and analysis of multivariate

observations, in: Proceedings of the Fifth Berkeley Symposium on MathemticalStatistics and Probability, pp. 281–297

76. Pham D (2001) Comput Vis Image Underst 84:285–29777. Bezdek J, Hall L, Clarke L (1993) Med Phys 20:1033–104878. Rignot E, Chellappa R, Dubois P (1992) IEEE Trans Geosci Remote Sensing

30(4):697–70579. Jang JS, Sun C (1995) Proc of the IEEE 83:378–40680. Kosko B (1991) Neural networks and fuzzy systems: a dynamical systems ap-

proach to machinhe intelligence, Prentice Hall, Englewood Cliffs, NJ81. Lin C, Lee C (1996) Neural fuzzy systems: a neural fuzzy synergism to intelligent

systems, Prentice-Hall, Englewood Cliffs, NJ82. Mitra S, Hayashi Y (2000) IEEE Trans Neural Netw 11(3):748–76883. Nauck D (1997) Neuro-Fuzzy Systems: Review and Prospects, in: Proc. Fifth

European Congress on Intelligent Techniques and Soft Computing (EUFIT’97),pp. 1044–1053

84. Fuller R (2000) Introduction to Neuro-Fuzzy Systems, Springer, Berlin Heidel-berg New York

85. Castellano G, Castiello C, Fanelli A, Mencar C (2005) Fuzzy Sets Syst149(1):187–207

86. Castiello C, Gorecki P, Caponetti L (2005) Neuro-Fuzzy Analysis of Docu-ment Images by the KERNEL System, Lecture Notes in Artificial Intelligence3849:369–374

87. Caponetti L, Castiello C, Gorecki P (2007) Document Page Segmentation usingNeuro-Fuzzy Approach, to appear in Applied Soft Computing Journal

88. Gorecki P, Caponetti L, Castiello C (2006) Multiscale Page Segmentation us-ing Wavelet Packet Analysis, in: Abstracts of VII Congress Italian Society forApplied and Industrial Mathematics (SIMAI 2006), p. 210

89. of Oulu Finland U, Document Image Database, http://www.ee.oulu.fi/research/imag/document/

90. Hinds S, Fisher J, D’Amato D (1990) A document skew detection method usingrun-length encoding and Hough transform, in: Proc. of the 10th Int. Conferenceon Pattern Recognition (ICPR), pp. 464–468

91. Hough P (1959) Machine Analysis of Bubble Chamber Pictures, in: InternationalConference on High Energy Accelerators and Instrumentation, CERN

92. Srihari S, Govindaraju V (1989) Mach Vision Appl 2:141–15393. Gonzalez R, Woods R (2007) Digital Image Processing 3rd edition, Prentice

Hall94. Lindeberg T (1994) Scale-space theory in computer vision, Kluwer, Boston95. Watt A, Policarpo F (1998) The Computer Image, ACM, Addison-Wesley96. Sammon J (1970) IEEE Trans Comput C-19:826–82997. Holland J (1992) Adaptation in Natural and Artificial Systems reprint edition,

MIT, Cambridge, MA,98. Mitchell M (1996) An Introduction to Genetic Algorithms, MIT, iSBN:0-262-

13316-4

Soft-Labeling Image Scheme Using FuzzySupport Vector Machine

Kui Wu and Kim-Hui Yap

School of Electrical and Electronic Engineering, Nanyang Technological University,Nanyang Avenue, Singapore 639798

Summary. In relevance feedback of content-based image retrieval (CBIR) systems,the number of training samples is usually small since image labeling is a time con-suming task and users are often unwilling to label too many images during thefeedback process. This results in the small sample problem where the performanceof relevance feedback is constrained by the small number of training samples. In viewof this, we propose a soft-labeling technique that investigates the use of unlabeleddata in order to enlarge the training data set. The contribution of this book chapteris the development of a soft labeling framework that strives to address the smallsample problem in CBIR systems. By studying the characteristics of labeled im-ages, we propose to utilize an unsupervised clustering algorithm to select unlabeledimages, which we call soft-labeled images. The relevance of the soft-labeled imagesis estimated using a fuzzy membership function, and integrated into the fuzzy sup-port vector machine (FSVM) for effective learning. Experimental results based on adatabase of 10,000 images demonstrate the effectiveness of the proposed method.

1 Introduction

1.1 Background

Recent explosion in the volume of image data has driven the demand for ef-ficient techniques to index and access the image collections. These includeapplications such as online image libraries, e-commerce, biomedicine, militaryand education, among others. Content-based image retrieval (CBIR) has beendeveloped as a scheme for managing, searching, filtering, and retrieving theimage collections. CBIR is a process of retrieving a set of desired images fromthe database on the basis of visual contents such as color, texture, shape,and spatial relationship that are present in the images. Traditional text-basedimage retrieval uses keywords to annotate images. This involves significant

K. Wu and K.-H. Yap: Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine,



272 K. Wu and K.-H. Yap

amount of human labor in manual annotation of large-scale image databases.In view of this, CBIR is proposed as an alternative to text-based image re-trieval. Many research and commercial CBIR systems have been developed,such as QBIC [6] , MARS [19], Virage [1], Photobook [18], VisualSEEk [23],PicToSeek [7] and PicHunter [5].

One of the most challenging problems in building a successful image re-trieval system lies in bridging the semantic gap. CBIR systems interpret theuser information needs based on a set of low-level visual features (color, tex-ture, shape) extracted from the images. However, these features may not cor-respond to the user interpretation and understanding of image contents. Thus,a semantic gap exists between the high-level concepts and the low-level fea-tures in CBIR. In view of this, relevance feedback has been introduced toaddress these problems [2, 5, 7, 8, 10, 12, 13, 17, 19–21, 25, 27, 29–34]. The mainidea is that the user is incorporated into the retrieval systems to providehis/her evaluation on the retrieval results. This enables the systems to learnfrom the feedbacks in order to retrieve a new set of images that better satisfythe user information requirement. Many relevance feedback algorithms havebeen adopted in CBIR systems and demonstrated considerable performanceimprovement [2, 5, 7, 8, 10, 12, 13, 17, 19–21, 25, 27, 29–34]. Some well-knownmethods include query refinement [19], feature re-weighting [10, 20], statisti-cal learning [5, 25, 29], neural networks [12, 13, 17, 33, 34], and support vectormachine (SVM) [2,8, 27,30,31].

Query refinement and feature re-weighting are two widely used relevancefeedback methods in CBIR. Query refinement tries to reach an optimal querypoint by moving it towards relevant images and away from the irrelevant ones.This technique has been implemented in many CBIR systems. The best-knownimplementation is the multimedia analysis and retrieval system (MARS) [19].Re-weighting technique updates the weights of the feature vectors so as toemphasize the feature’s components that help to retrieve relevant images,while de-emphasize those that hinder this process. It uses a heuristic formu-lation to adjust the weight parameters empirically. Statistical learning hasbeen developed by modeling the probability distribution of images in thedatabase [5, 29]. Bayesian classifier has been proposed that treats positiveand negative feedback samples with different strategies [25]. Positive exam-ples are used to estimate a Gaussian distribution that represents the desiredimages for a given query, while the negative examples are used to modify theranking of the retrieved candidates. Neural networks have been adopted ininteractive image retrieval in view of their learning capability and generaliza-tion power [12,13,17,33,34]. A fuzzy radial basis function network (FRBFN)has been proposed to learn the users’ fuzzy perception of visual contents us-ing fuzzy relevance feedback [33, 34]. It provides a natural way to model theuser interpretation of image similarity. Another popular relevance feedbackmethod in CBIR is centered on SVM [2,8,27,30,31]. SVM is a powerful learn-ing machine. It finds an optimal separating hyperplane that maximizes themargin between two classes in a kernel-induced feature space. SVM-based

Soft-Labeling Image Scheme Using Fuzzy Support Vector Machine 273

active learning has been proposed to carefully select samples shown to theusers for labeling. This is in order to achieve maximal information gain indecision-making [27]. It chooses the unseen images that are closest to theSVM decision hyperplane as the most informative images for feedback.

1.2 Related Work

Despite the previous works on relevance feedback for CBIR systems, it is stilla challenging task to develop effective and efficient interactive mechanismsto yield satisfactory retrieval performance. One key difficulty associated withrelevance feedback is the lack of sufficient labeled images since users usuallydo not have the patience to label a large number of images. Therefore, theperformance of relevance feedback methods is often constrained by the lim-ited number of training samples. To deal with this problem, some works havebeen done to incorporate the unlabeled data to improve the learning perfor-mance. Discriminant Expectation Maximization (D-EM) algorithm has beenintroduced to incorporate the unlabeled samples to estimate the underlyingprobability distribution [32]. The results are promising, but the computationalcomplexity can be significant for large databases. Transductive support vec-tor machine (TSVM) for text classification has been proposed to tackle theproblem by incorporating the unlabeled data [11]. It has also been applied forimage retrieval [30]. The method proposes to incorporate unlabeled imagesto train an initial SVM, followed by standard active learning. It is, however,observed that the performance of this method may be unstable in some cases.Incorporating prior knowledge into the SVM has also been introduced to re-solve the small sample problem [31]. All these proposed methods show somepromising outcomes, however few can learn from the labeled and unlabeleddata effectively.

To find solutions to solve the small sample problem faced by current rele-vance feedback methods, we develop a soft labeling framework in this chapterthat integrates the advantages of soft-labeling and fuzzy support vector ma-chine (FSVM). It exploits inexpensive unlabeled data to augment the small setof labeled data, hence potentially improves the retrieval performance. This isin contrast to most existing relevance feedback approaches in CBIR systemsthat are concerned with the use of labeled data only. The useful unlabeledimages are identified through exploiting the characteristics of the labeled im-ages. Different soft-labels of “relevant” or “irrelevant” are then automaticallypropagated to the selected unlabeled images by a label propagation process.As these images are not labeled explicitly by the users, there is a poten-tial imprecision embedded in their class information. In view of this, a fuzzymembership function is employed to estimate the class membership of thesoft-labeled images. The fuzzy information is then integrated into the FSVMfor active learning.

The organization for the rest of this chapter is outlined as follows. Section 2presents an overview of the proposed soft-labeling framework. In Sect. 3, we


describe FSVM and discuss the soft-label estimation scheme in details. Wefurther explore the fuzzy membership function which is developed to deter-mine the implicit class membership of the soft-labeled images. Experimentalresults using the proposed method are discussed in Sect. 4. Finally, concludingremarks are given in Sect. 5.

2 Overview of the Proposed Soft-Labeling Framework

2.1 Overview of the System

The proposed soft labeling framework is a unified framework that incorporatessoft-labeling into FSVM in the context of CBIR. The general overview of theframework is summarized in Fig. 1.

The main processing of the system involves the offline and online stages.Offline processing includes feature extraction, representation, and organiza-tion. Online processing is the interaction between the user and the system.User first submits his/her query to the system through query-by-example(QBE). The system performs the K-nearest neighbor (K-NN) search usingthe Euclidean distance for similarity matching. The top l0 most similar im-ages are shown to the user for feedback. The user provides the feedback onthe l0 images as either relevant or irrelevant. Based on the l0 labeled images,an initial SVM classifier is trained. The SVM active learning is employedby selecting l unlabeled images that are closest to the current SVM decisionboundary for the user to label. The l labeled images are then added to thepreviously labeled training set. Next, a two-stage clustering is performed sep-arately on the labeled relevant and irrelevant images. The formed clusters areused for unlabeled image selection and soft-label assignment. A fuzzy mem-bership function is further developed to estimate the class membership of thesoft-labeled images. An FSVM is then trained by emphasizing the labeled im-ages over the soft-labeled images during training. A new ranked list of imageswhich better approximates the user’s preferences is obtained and presentedto the user. If the user is unsatisfied with the retrieval results, SVM activelearning is utilized to present another set of l unlabeled images that are themost informative for the user to label. This feedback process repeats until theuser is satisfied with the retrieval results.

2.2 Feature Extraction

Feature extraction and representation is a fundamental process in CBIR sys-tems. Color, texture, and shape are the most frequently used visual featuresin current CBIR systems. Each feature may have several representations. Nosingle best representation exists for a given feature due to human perceptualsubjectivity. Different representations characterize different aspects of the fea-ture. The general guideline for the selection of low-level features when design-ing a CBIR system should obey the following criteria: perceptual similarity,


Two-stage clustering for unlabeled image selectionand soft-label assignment

Perform k-nearest neighbors (k-NN) search andreturn the top l0 most similar images

User feedback of relevant and irrelevant images on the l0 images,

and train an initial SVM classifier

Employ SVM active learning to select lunlabeled images

User feedback of the l images, add them to the previously labeled training set

Evaluate the soft relevance membership ofthe soft-labeled images

Haveterminationcriteria been

satisfied?

End

Yes

No

Train an FSVM using the hybrid of labeled andsoft-labeled images

Retrieve new relevant images from databasebased on trained FSVM

Fig. 1. General overview of the proposed soft-labeling framework

efficiency, stability, and scalability. Based on this guideline and literature sur-vey on different features, color and texture are employed in this work. Colorhistogram [26], color moments [16] and color auto-correlogram [10] are chosenas the color feature representation, while wavelet moments [24] and Gaborwavelet [15] are selected as the texture feature representation.


Color histogram, representing the first-order color distribution in an im-age, is one of the most widely used color descriptors. It is easy to compute,invariant to rotation, translation, and viewing axis. We implement the colorhistogram by first converting the RGB representation of each image into itsHSV equivalence. Then, each H, S, V component is uniformly quantized into8, 2, 2 bins respectively to get a 32-dimensional feature vector.

Color moments have been proposed to overcome the quantization effectsin color histogram. It characterizes the color distribution of an image by itsmoments (mean, variance and skewness). In this study, the first two moments(mean and variance) from the R, G, B color channel are extracted as the colorfeature representation to form a six-dimensional feature vector.

Color auto-correlogram is a two-dimensional spatial extension of color his-togram. Color histogram does not provide any spatial information, thereforeimages with similar histograms may have different appearances. Color correl-ogram integrates spatial information with the color histogram by construct-ing a color co-occurrence matrix indexed by color pairs and distance, witheach entry (i, j) representing the probability of finding a pixel of color j at adistance k from a pixel of color i. The storage requirement for a co-occurrencematrix is significant, therefore only its main diagonal is computed and stored,which is known as color auto-correlogram. The auto-correlogram of the imageI for color Ci, is given as:

γ(k)Ci

(I ) = Pr [|p1 − p2| = k, p2 ∈ ICi|p1 ∈ ICi

] , (1)

where p1 is a pixel of color Ci in the image I , p2 is another pixel of the samecolor Ci with a distance of k away from p1. D8 distance (chessboard distance)is chosen as the distance measure: D8(p, q) = max(|px − qx|, |py − qy|), whichis the greater distance in the x- or y-direction.

Wavelet moments describe the global texture properties of images usingthe energy of discrete wavelet transform subbands. It is a simple wavelet-transform feature of an image that is constructed using the mean and stan-dard deviation of the energy distribution at each decomposition level. Thisin turn corresponds to the distribution of “edges” in the horizontal, vertical,and diagonal directions at different resolutions. In this study, we employ theDaubechies wavelet transform with a three-level decomposition. The meanand standard deviation of the transform coefficients are used to compose a20-dimensional feature vector.

Gabor wavelet is widely adopted to extract texture features, and has beenshown to be very efficient. Basically, Gabor filters are a group of wavelets, witheach wavelet capturing energy at a specific frequency and a specific direction.Expanding a signal using this basis provides a localized frequency description,therefore capturing local features/energy of the signal. A 2D Gabor functiong(x, y) is defined as:

g(x, y) =(

12πσxσy

)exp

[−1

2

(x2

σ2x

+y2

σ2y

)+ 2πjWx

]. (2)


The self-similar functions are obtained by appropriate dilations and rotationsof g(x, y)through the generating function:

gmn(x, y) = a−mg(x′, y′)x′ = a−m(x cos θn + y sin θn)y′ = a−m(−x sin θn + y cos θn)

(3)

where a > 1, m and n specify the scale and orientation of the wavelet re-spectively, W is the modulation frequency. The half peak radial band-widthis chosen to be octave, which determines σx and σy. In this study, Gaborwavelet filters spanning four scales: 0.05, 0.1, 0.2 and 0.4 and six orientations:θ0 = 0, θn+1 = θn +π/6 are used. For a given image I(x, y), its Gabor wavelettransform is defined by:

Wmn(x, y) =∫

I(x1,y1)g∗mn(x − x1,y − y1)dx1y1, (4)

where ∗ denotes complex conjugation. The mean and standard deviation ofthe transform coefficient magnitudes are used to form a 48-dimensional featurevector.

After all the color and texture features have been extracted offline, we con-catenate the feature elements from all the individual features into an overallfeature vector with a dimension of 170. Since different components within afeature vector may have different physical quantities, their magnitudes can beinconsistent, thereby biasing the similarity measure. We perform a Gaussiannormalization to all the feature vectors to ensure equal emphasis is put oneach component within a feature vector [20].

3 Soft-Labeling Fuzzy Support Vector Machine

3.1 Proposed Concept of Soft Labeling

Conventional relevance feedback in interactive CBIR systems uses only thelabeled images for learning. However, the labeled images are available only insmall quantities since it is not user-friendly to let the users label too manyimages for feedback. This results in the small sample problem where learningfrom such a small number of training samples may not produce good retrievalresults, even for powerful learning machine such as SVM. Therefore, it is im-perative to find solutions to solve the small sample problem faced by relevancefeedback.

Considering that obtaining a large number of labeled images is labor in-tensive while unlabeled images are readily available and abundant, we proposeto augment the available labeled images by making use of the potential roleof unlabeled images. It is worth noting that unlabeled images can degrade theperformance if used improperly. Consequently, they should be carefully cho-sen so that they will be beneficial to the retrieval performance. Each selected


unlabeled image is assigned a soft-label of either “relevant” or “irrelevant”based on an algorithm to be explained in Sect. 3.2. These soft-labeled imagesare fuzzy in nature since they are not explicitly labeled by the users. Thereforethe potential imprecision embedded in their class information should be takeninto consideration. We employ a fuzzy membership function to determine thedegree of uncertainty for each soft-labeled image, hence putting into contextthe relative importance of these images. These soft-labeled samples are thencombined with those labeled images to train the FSVM.

3.2 Selection of Unlabeled Images and Label Propagation

In this work, we present a method to select the unlabeled images by studyingthe characteristics of the labeled images. The selection criterion is to determinecertain informative samples among the unlabeled ones which are “similar” tothe labeled images in terms of the visual features for soft-labeling and fuzzymembership estimation. The enlarged hybrid data set consisting of both soft-labeled and explicitly labeled samples is then utilized to train the FSVM.

It is observed that the labeled images usually exhibit local characteristicsof image similarity. To exploit this property, it is desirable to adopt a multi-cluster local modeling strategy. Taking into account the local multi-clusternature of image similarity, we employ a two-stage clustering process to deter-mine the local clusters. The labeled samples are clustered according to theirtypes: relevant or irrelevant. K-means clustering is one of the most widelyused clustering algorithms. It groups the samples into K clusters by usingan iterative algorithm that minimizes the sum of distances from each sampleto its respective cluster centroid for all the clusters. Notwithstanding its at-tractive features, K-means clustering requires a specified number of clustersin advance and is sensitive to the initial estimates of the clusters. To rectifythis difficulty, we adopt a two-stage clustering strategy in this work. First,subtractive clustering is employed as a preprocessing step to estimate thenumber and structure of clusters as it is fast, efficient, and does not requirethe number of clusters to be specified a priori [3]. These estimates are thenemployed by K-means to perform clustering based on iterative optimizationin the second stage.

Subtractive clustering assumes each sample as a potential cluster center. Itcomputes a potential field which determines the likelihood of a sample beinga cluster center. Let {xi}n

i=1 ⊂ RR be a set of R-dimensional samples to beclustered. The initial potential function Pk=1(i) of the ith sample xi, expressedin terms of the Euclidean distance to the other samples xj , is defined as:

Pk=1(i) =n∑

j=1

exp

(−‖xi − xj‖2

r2a

)i = 1, . . . , n, (5)

where ra is a positive coefficient defining the range of the field. The poten-tial function has large values at densely populated neighborhoods, suggesting


strong likelihood that clusters may exist in these regions. The subtractiveclustering algorithm can be summarized as follows:

1. Compute Pk=1(i) for i = 1, . . . , n and select the sample with the highestpotential as the first cluster center. Let x∗

1 and P ∗1 denote the first cluster

center and its potential, respectively.2. For k = 2, . . . ,K, update the potential of each sample according to:

Pk(i) = Pk−1(i) − P ∗k−1 exp

⎛⎜⎝−

∥∥∥xi − x∗k−1

∥∥∥2

r2b

⎞⎟⎠ i = 1, . . . , n, (6)

where x∗k−1 and P ∗

k−1 are the (k−1)th cluster center and its potential value,rb is a positive coefficient defining the neighborhood radius for potential re-duction, and K is the maximum number of potential clusters. Equation (6)serves to remove the residual potential of the (k−1)th cluster center from thecurrent kth iteration field. The samples that are close to the (k−1)th clus-ter center will experience greater reduction in potential, hence reducing theirlikelihood to be chosen as the next center. Let x∗

k be the sample data with themaximum potential P ∗

k in the current kth iteration, the following criteria areused to determine whether it should be selected as the current cluster center:

if P∗k

P∗1

> εA, accept x∗k as the kth cluster center

else if P∗k

P∗1

< εR, reject x∗k and terminate the clustering process

else if P∗k

P∗1

+ dminra

≥ 1, accept x∗k as a cluster center

else reject x∗k and set its potential to zero (P ∗

k ← 0), and repeat the processwith the sample with the next highest potentialendif

3. Repeat step 2 until the termination criterion is satisfied or the maximumnumber of iteration is reached.

In step 2, εA is the acceptance ratio above which a sample will be acceptedas a cluster center, εR is the rejection ratio below which a sample will berejected, and dmin is the shortest distance between x∗

k and all previously foundcluster centers. If the potential of the sample falls between the acceptance andrejection ratios, we will accept it only if it achieves a good compromise betweenhaving a reasonable potential and being sufficiently far from all existing clustercenters.

After subtractive clustering, we obtain a set of cluster centers, which isused as the initial center estimates for K-means clustering. Two sets of sepa-rate clusters are then obtained, relevant and irrelevant sets after the two-stageclustering. Unlabeled image selection and soft-label assignment are then basedon a similarity measure analogous to the K-NN technique. That is, samplesclose in distance will potentially have similar class labels. For each cluster


formed by the labeled images using the two-stage clustering scheme, K near-est unlabeled neighbors are chosen based on their Euclidean distances to thecenter of the respective labeled cluster. The label (relevant or irrelevant) ofeach labeled cluster is then propagated to the unlabeled neighbors. This isreferred to as soft-labeling process. As the computational cost will increasewith respect to the number of soft-labeled images, therefore, only the most“similar” neighbor for each cluster is selected in this work.

3.3 Soft Relevance Membership Estimation for Soft-LabeledImages

In consideration of the potential fuzziness associated with the soft-labeled im-ages, our objective here is to determine a soft relevance membership functiong(xk) : RR → [0, 1] that assesses each soft-labeled image xk and assigns ita proper relevance value between zero and one. The estimated relevance ofthe soft-labeled images is then used in FSVM training. In this study, g(xk) isdetermined by two measures, fC(xk) and fA(xk). First, since clustering hasbeen performed on each positive (relevant) and negative (irrelevant) class sep-arately to get multiple clusters per class, the obtained clusters in each classcan be employed to generate the membership value of xk, namely fC(xk). Fur-ther, the agreement between the predicted label obtained in Sect. 3.2, and thepredicted label obtained from the trained FSVM can also be utilized to assessthe degree of relevance of the soft-labeled samples, namely fA(xk). These twomeasures affecting the fuzzy membership are combined together to producethe final soft relevance estimate, namely:

g(xk) = fC(xk)fA(xk). (7)

Let vSi denote the center of the ith cluster with the same class label as thesoft-labeled image xk, while vOj denote the center of the jth cluster with theopposite class label to xk. min

i(xk−vSi)T(xk−vSi) and min

j(xk−vOj)T(xk−

vOj) represent the distance between xk and the nearest cluster centers withthe same and opposite class labels, respectively. We then define the followingexpression:

Q(xk) =min

i(xk − vSi)T(xk − vSi)

minj

(xk − vOj)T(xk − vOj). (8)

Intuitively, the closer a soft-labeled image is to the nearest cluster of the sameclass label, the higher is its degree of relevance. In contrast, the closer a soft-labeled image is to the nearest cluster of the opposite class label, the lower isits degree of relevance. Based on this argument, an exponentially based fuzzyfunction is selected:

fC(xk) ={

exp (−a1Q(xk)) if Q(xk) < 10 otherwise , (9)


where a1 > 0 is a scaling factor. This membership function is divided intotwo scenarios. If the distance ratio is smaller than 1, suggesting that the soft-labeled image is closer to the nearest cluster with the same class label, then wewill estimate its soft relevance. Otherwise, if the soft-labeled image is closerto the nearest cluster with the opposite class label, a zero value is assigned.

The second factor of the fuzzy function is chosen as a sigmoid function asfollows:

fA(xk) =

⎧⎪⎪⎨⎪⎪⎩

11 + exp(−a2y)

soft-label is positive

11 + exp(a2y)

otherwise(10)

where a2 > 0 is a scaling factor. y is the directed distance of the soft-labeledimage xk to the FSVM boundary (the decision function output of FSVMfor the soft-labeled image xk). We will explain the rationale of the fuzzyexpression in (10) by first considering that the soft-label of the selected imagehas been determined as positive in Sect. 3.2. In this case, the upper equationin (10) will be used. If y has a large positive value, this will suggest that it ismost likely to be a relevant image. Since there is a strong agreement betweenthe predicted soft-label from Sect. 3.2 and the predicted class label using thetrained FSVM, its fuzzy membership value should be set to a large valueclose to unity. If y has a large negative value, this will suggest that it is mostlikely to be an irrelevant image. Since there is a strong disagreement betweenthe predicted soft-label from Sect. 3.2 and the predicted class label using thetrained FSVM, its fuzzy membership value should be set to a small value closeto zero. The same arguments apply when the soft-label of the selected imagehas been determined to be negative in Sect. 3.2.

3.4 Support Vector Machine (SVM) and Active Learning

SVM is an implementation of the method of structural risk minimization(SRM) [28]. This induction principle is based on the fact that the error rate ofa learning machine on test data (i.e. the generalization error rate) is boundedby the sum of the training error rate and a term that depends on the Vapnik–Chervonenkis (VC) dimension. The basic idea of SVM involves first trans-forming data in the original input space to higher dimensional feature spaceby utilizing the technique known as “kernel trick”. In doing so, nonlinearlyseparable data can be transformed into a linearly separable feature space. Anoptimal decision hyperplane can then be constructed in this high dimensionalfeature space by maximizing the margin of separation between positive andnegative samples. Linear decision boundary constructed in the feature spacecorresponds to nonlinear decision boundary in the input space. By the use ofa kernel function, it is possible to compute the separating hyperplane with-out explicitly carrying out the mapping in the feature space. The optimalhyperplane is determined by solving a quadratic programming (QP) problem,


which can be converted to its dual problem by introducing Lagrangian multi-pliers. The training data points that are nearest to the separating hyperplaneare called support vectors. The optimal hyperplane is specified only by thesupport vectors.

Let S = {(xi, yi)}ni=1 be a set of n training samples, where xi ∈ RR is an R-

dimensional sample in the input space, and yi ∈ {−1, 1} is the class label of xi.SVM first transforms data in the original input space into higher dimensionalfeature space through a mapping function z = ϕ(x). It then finds the optimalseparating hyperplane with minimal classification errors. The hyperplane canbe represented as:

w · z + b = 0, (11)

where w is the normal vector of the hyperplane, and b is the bias which is ascalar. In particular, the set S is said to be linearly separable if the followinginequalities hold for all training data in S:{

w · zi + b ≥ 1 if yi = 1w · zi + b ≤ −1 if yi = −1,

i = 1, . . . , n. (12)

For linearly separable case, the optimal hyperplane can be obtained by max-imizing the margin of separation between the two classes. Maximizing themargin leads to solving the following constrained optimization problem:

minimize12‖w‖2

subject to yi(w · zi + b) ≥ 1, i = 1, . . . , n(13)

This optimization problem can be solved by QP. However, for the linearlynonseparable case where the inequalities in (12) does not hold for some datapoints in S, a modification to the original SVM formulation can be made byintroducing nonnegative variables {ξi}n

i=1. In this case, the margin of separa-tion is said to be soft. The constraint in (12) is modified to:

yi(w · zi + b) ≥ 1 − ξi, i = 1, . . . , n. (14)

The {ξi}ni=1 are called slack variables. They measure the deviation of a

data point from the ideal condition of pattern separability. Misclassificationsoccur when ξi > 1. The optimal separating hyperplane is then found by solvingthe following constrained optimization problem:

minimize12‖w‖2 + C

n∑i=1

ξi

subject to yi(w · zi + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , n(15)

where C is the regularization parameter controlling the tradeoff betweenmargin maximization and classification error. Larger value of C producesnarrower-margin hyperplane with less misclassifications. The optimization


problem can be transformed into the following equivalent dual problem usingLagrange multipliers:

maximizen∑

i=1

αi −12

n∑i=1

n∑j=1

αiαjyiyjzi · zj

subject ton∑

i=1

yiαi = 0, 0 ≤ αi ≤ C, i = 1, . . . , n. (16)

where αi is the Lagrange multiplier associated with the constraints in (14).The data points that correspond with αi > 0 are called support vectors. Theoptimal solution for the weight vector w is a linear combination of the trainingsamples which is given by:

w =n∑

i=1

αiyizi. (17)

The decision function of the SVM can then be obtained as:

f(x) = w · z + b =n∑

i=1

αiyizi · z + b =n∑

i=1

αiyiϕ(xi) · ϕ(x) + b. (18)

It is noted that both the construction of the optimal hyperplane in (16)and the evaluation of the decision function in (18) only require the evalua-tion of dot products ϕ(xi) · ϕ(xj) or ϕ(xi) · ϕ(x). This implies that we donot necessarily need to know about ϕ in explicit form. Instead, a functionK(·, ·)called kernel function is introduced that can compute the inner prod-uct of two data points in the feature space, i.e.K(xi,x) = ϕ(xi) ·ϕ(x). Thereare three common types of kernels used in SVM including polynomial kernel,radial basis function kernel and sigmoid kernel. Using this kernel trick, thedual optimization problem in (16) becomes:

maximizen∑

i=1

αi −12

n∑i=1

n∑j=1

αiαjyiyjK(xi,xj)

subject ton∑

i=1

yiαi = 0, 0 ≤ αi ≤ C, i = 1, . . . , n(19)

And we can construct the optimal hyperplane in the feature space withouthaving to know the mapping ϕ:

f(x) =n∑

i=1

αiyiK(xi,x) + b. (20)

Active learning is designed to achieve maximal information gain or mini-mize uncertainty in decision making. It selects the most informative samplesto query the users for labeling. Among the various active learning techniques,SVM-based active learning is one of the most promising methods currentlyavailable [27]. It selects samples that are closest to the current SVM deci-sion boundary as the most informative points. Samples that are farthest awayfrom the boundary and on the positive side are considered as the most relevantimages. The same selection strategy is adopted in this work.


3.5 Fuzzy Support Vector Machine (FSVM)

Because of the nice properties of SVM, it has been successfully utilized in manyreal-world applications. However, SVM is still limited to crisp classificationwhere each training example belongs to either one or the other class withequal importance. There exist situations where the training samples do not fallneatly into discrete classes. They may belong to different classes with differentdegrees of membership. To solve this problem, FSVM has been developed [14].FSVM is an extended version of SVM that takes into consideration differentimportance of training data. It exhibits the following properties that motivateus to adopt it in our framework: integration of fuzzy data, strong theoreticalfoundation, and excellent generalization power.

In FSVM, each training sample is associated with a fuzzy membershipvalue {µi}n

i=1 ∈ [0, 1]. The membership value µi reflects the fidelity of the data,or in other words, how confident we are about the actual class information ofthe data. The higher its value, the more confident we are about its class label.The optimization problem of the FSVM is formulated as follows [14]:

minimize12‖w‖2 + C

n∑i=1

µiξi

subject to yi(w · zi + b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , n(21)

It is noted that the error term ξi is scaled by the fuzzy membership valueµi. The fuzzy membership values are used to weigh the soft penalty termin the cost function of SVM. The weighted soft penalty term reflects therelative fidelity of the training samples during training. Important sampleswith larger membership values will have more impact in the FSVM trainingthan those with smaller values. The detailed determination of the membershipvalue {µi}n

i=1 has been described in Sect. 3.3, that is, µk = g(xk).Similar to the conventional SVM, the optimization problem of FSVM can

be transformed into its dual problem as follows:

maximizen∑

i=1

αi −12

n∑i=1

n∑j=1

αiαjyiyjK(xi,xj)

subject ton∑

i=1

yiαi = 0, 0 ≤ αi ≤ µiC, i = 1, . . . , n(22)

Solving (22) will lead to a decision function similar to (20), but with differentsupport vectors and corresponding weights αi.


4.1 Image Database and User Interface

The framework is developed on a PC with the following specifications: Pen-tium4 2.4-GHz processor, 512-M RAM, Windows XP, and Matlab 6.5. The


performance of the framework is evaluated on an image database containing10,000 natural images [4]. It contains 100 different semantic categories, whichare predefined by the Corel Photo Gallery based on their semantic conceptsas shown in Fig. 2.

A general overview on the operation of user interface in our retrieval sys-tem is shown in Fig. 3. Initially, the user can select a query image by browsingthrough the image database. The selected query is displayed at the top leftcorner. Next, the user can search the image database by pressing the “Search”button, and the ten most relevant images are ranked and displayed in a de-

Fig. 2. Selected sample images from the database

Fig. 3. Illustration of user interface


scending order of relevance from left to right, and top to bottom. It is notedthat under each displayed image, a pull-down menu is available which enablesthe user to select two possible choices of feedback, relevant and irrelevant, asillustrated in the figure. The user will simply be asked to select each displayedimage as either relevant or irrelevant according to his/her information need.The user can then submit his/her feedback by pressing the “Feedback” but-ton. The system then learns from the feedback images, and presents a newranked list of images to the user for further feedback. The process continuesuntil the user is satisfied with the retrieved results.

The proposed soft-labeling framework can be implemented in practi-cal applications such as image retrieval through bandwidth-limited, display-constrained devices, e.g. mobile phones with camera, where only a small num-ber of images is displayed to the user. For instance, a girl in the zoo sees a foxsquirrel that is of interest to her and would like to find out more similar squir-rel images. Therefore, she takes a picture of the fox squirrel itself using hermobile phone and sends it as a query to the server. The server then performsa similarity comparison with the images in the database and retrieves a setof images. If the girl is unsatisfied with the retrieval results, she may providefeedback on the retrieved images displayed on the screen of her mobile phone.Conventional relevance feedback methods are not able to achieve improvedperformance with such small number of feedback samples. In contrast, theproposed framework strives to utilize the unlabeled images to augment theavailable labeled images. In doing so, the girl can get satisfactory retrievalresults within the first few iterations. Further, if the girl is cooperative andwilling to provide more than one screen of feedback images before seeing theresults, the proposed framework with active learning is of great value. Aftergetting feedbacks for one or more screens of training images, the systems canselect the most informative samples to query the girl for labeling to achievemaximal information gain or minimized uncertainty in decision-making.

The proposed method is applied in our retrieval system. Subtractive clus-tering is utilized to determine the cluster centers of the relevant and irrelevantimages. It uses the following parameters: ra is set to 0.075 and 0.25 for relevantand irrelevant samples, respectively, with rb = 1.2ra, εA = 0.5, and εR = 0.2.RBF kernel, K(xi,xj) = exp

(−‖xi−xj‖2

2σ2

)is used for SVM, where σ = 3 and

the regularization parameter C = 100. The following parameters are used forsoft relevance membership estimation of soft-labeled images: a1 = 1, a2 = 3.

4.2 Performance Evaluation

In our experiment, we use objective measure to evaluate the performance ofthe proposed soft-labeling method using FSVM. The objective measure isbased on the Corel’s predefined ground truth. That is, the retrieved imagesare judged to be relevant if they come from the same category as the query.One hundred queries with one from each category are selected for evaluation.


Fig. 4. The average precision-vs.-recall graphs after the first iteration of activelearning

Retrieval performance is evaluated by ranking the database images accordingto their directed distances to the SVM boundary after each active learningiteration. Five iterations of feedbacks are recorded. Precision-vs.-recall curveis a standard performance measure in information retrieval, and is adoptedin our experiment [22]. Precision is the number of retrieved relevant imagesover the total number of retrieved images. Recall is defined as the numberof retrieved relevant images over the total number of relevant images in thecollection. The precision and recall rates are averaged over all the queries.The average precision-vs.-recall (APR) graph after the first iteration of activelearning for five initial labeled images (l0 = 5) is shown in Fig. 4. It is observedthat the precision rate decreases with the increase of recall. This means thatwhen more relevant images are retrieved, a higher percentage of irrelevantimages will be probably retrieved.

In addition, we have adopted another measure called retrieval accuracy toevaluate the retrieval system [9,25]. The performance of the proposed methodis given in Fig. 5 for the case of l0=10. The retrieval accuracy is averagedover the 100 queries. We observe that the retrieval accuracy of the proposedmethod increases quickly in the initial stage. This is a desirable feature sincethe user can obtain satisfactory results quickly. It is worth emphasizing thatthe initial retrieval performance is very important since users often expectquick results and are unwilling to provide much feedback. Hence, reducingthe amount of user feedback while providing good retrieval results is of greatinterests for many CBIR systems. Further, the method reaches a high steady-state retrieval accuracy of 95% in about five feedback iterations, which is animprovement of 35% over its initial retrieval accuracy.


Fig. 5. Retrieval accuracy in top ten results

5 Conclusions

This chapter presents a soft-labeling framework that addresses the small sam-ple problem in interactive CBIR systems. The technique incorporates soft-labeled images into FSVM along with labeled images for effective retrieval.By exploiting the characteristics of the labeled images, soft-labeled imagesare selected through an unsupervised clustering algorithm. Further, the rel-evance of the soft-labeled images is estimated using the fuzzy membershipfunction. FSVM-based active learning is then performed based on the hybridof soft-labeled and explicitly labeled images. Experimental results confirm theeffectiveness of our proposed method.

References

1. Amarnath G, Ramesh J (1997) Visual information retrieval. Communicationsof ACM 40(5):70–79

2. Chen Y, Zhou XS, Huang TS (2001) One-class SVM for learning in image re-trieval. Proceedings of the IEEE International Conference on Image Processing,pp. 815–818

3. Chiu S (1994) Fuzzy model identification based on cluster estimation. Journalof Intelligent & Fuzzy Systems 2(3):267–278

4. Corel Gallery Magic 65000 (1999) http://www.corel.com5. Cox IJ, Miller ML, Minka TP, Papathomas TV, Yianilos PN (2000) The

Bayesian image retrieval system, PicHunter: Theory, implementation, and psy-chophysical experiments. IEEE Transactions on Image Processing 9(1):20–37

6. Flickher M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M,Hafner J, Lee D, Petkovic D, Steele D, Yanker P (1995) Query by image andvideo content: The QBIC system. IEEE Computer 28(9):23–32


7. Gevers T, Smeulders AWM (2000) PicToSeek: Combining color and shape in-variant features for image retrieval. IEEE Transactions on Image Processing9:102–119

8. Guo GD, Jain AK, Ma WY, Zhang HJ (2002) Learning similarity measure fornatural image retrieval with relevance feedback. IEEE Transactions on NeuralNetworks 13(4):811–820

9. He XF, King O, Ma WY, Li MJ, Zhang HJ (2003) Learning a semantic spacefrom user’s relevance feedback for image retrieval. IEEE Transactions on Cir-cuits and Systems for Video Technology 13:39–48

10. Huang J, Kumar SR, Metra M (1997) Combining supervised learning with colorcorrelograms for content-based image retrieval. Proceedings of ACM Multime-dia, pp. 325–334

11. Joachims T (1999) Transductive inference for text classification using supportvector machines. Proceedings of the International Conference on Machine Learn-ing, pp. 200–209

12. Laaksonen J, Koskela M, Oja E (2002) PicSom–self-organizing image retrievalwith MPEG-7 content descriptions. IEEE Transactions on Neural Network13(4):841–853

13. Lee HK, Yoo SI (2001) A neural network-based image retrieval using nonlinearcombination of heterogeneous features. International Journal of ComputationalIntelligence and Applications 1(2):137–149

14. Lin CF and Wang SD (2002) Fuzzy support vector machines. IEEE Transactionson Neural Networks 13(2):464–471

15. Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval ofimage data. IEEE Transactions on Pattern Analysis and Machine Intelligence18:837–842

16. Markus S, Markus O (1995) Similarity of color images. Proceedings of SPIEStorage and Retrieval for Image and Video Databases

17. Muneesawnag P, Guan L (2002) Automatic machine interactions for content-based image retrieval using a self-organizing tree map architecture. IEEE Trans-actions on Neural Networks 13(4):821–834

18. Pentland A, Picard R, Sclaroff S (1994) Photobook: tools for content-basedmanipulation of image databases. Proceedings of SPIE 2185:34–47

19. Rui Y, Huang TS, Mehrotra S (1997) Content-based image retrieval with rele-vance feedback in MARS. IEEE International Conference on Image Processing,Washington DC, USA, pp. 815–818

20. Rui Y, Huang TS, Ortega M, Mehrotra S (1998) Relevance feedback: a powertool for interactive content-based image retrieval. IEEE Transactions on Circuitsand Video Technology 8(5):644–655

21. Rui Y, Huang TS (2000) Optimizing learning in image retrieval. Proceedings ofIEEE International Conference on Computer Vision and Pattern Recognition1:236–243

22. Salton G, McGill MJ (1982) Introduction to Modern Information Retrieval. NewYork: McGraw-Hill

23. Smith JR, Chang SF (1996) VisualSEEk: a fully automated content based imagequery system. Proceedings ACM Multimedia

24. Smith JR, Chang SF (1996) Automated binary texture feature sets for imageretrieval. Proceedings of the International Conference on Acoustics, Speech andSignal Processing, Atlanta, GA


25. Su Z, Zhang HJ, Li S, Ma SP (2003) Relevance feedback in content-based imageretrieval: Bayesian framework, feature subspaces, and progressive learning. IEEETransactions on Image Processing 12:924–937

26. Swain M, Ballard D (1991) Color indexing. International Journal of ComputerVision 7(1):11–32

27. Tong S, Chang E (2001) Support vector machine active leaning for image re-trieval. Proceedings of the Ninth ACM Conference on Multimedia

28. Vapnik VN (1995) The Nature of Statistical Learning Theory. New York:Springer-Verlag

29. Vasconcelos N, Lippman A (1999) Learning from user feedback in image re-trieval systems. Proceedings of Neural Information Processing Systems, Denver,Colorado

30. Wang L, Chan KL (2003) Bootstrapping SVM active learning by incorporatingunlabelled images for image retrieval. Proceedings of the IEEE InternationalConference on Computer Vision and Pattern Recognition, pp. 629–634

31. Wang L, Chan KL (2004) Incorporating prior knowledge into SVM for imageretrieval. Proceedings of the IEEE International Conference on Pattern Recog-nition, pp. 981–984.

32. Wu Y, Tian Q, Huang TS (2000) Discriminant-EM algorithm with applicationto image retrieval. Proceedings of IEEE International Conference on ComputerVision and Pattern Recognition, South Carolina

33. Yap KH, Wu K (2005) Fuzzy relevance feedback in content-based image re-trieval systems using radial basis function network. Proceedings of the IEEEInternational Conference Multimedia and Expo, Amsterdam, The Netherlands,pp. 177–180

34. Yap KH, Wu K (2005) A soft relevance framework in content-based image re-trieval systems. IEEE Transactions on Circuits and Systems for Video Technol-ogy 15(12):1557–1568

Temporal Texture Characterization: A Review

Ashfaqur Rahman1 and Manzur Murshed2

1 Department of Computer Science, American International University BangladeshDhaka, [email protected]

2 Gippsland School of Information Technology, Monash UniversityChurchill, VIC, [email protected]

Summary. A large class of objects commonly experienced in a real world scenarioexhibits characteristic motion with certain form of regularities. Contemporary lit-erature coined the term “temporal texture”1 to identify image sequences of suchmotion patterns that exhibit spatiotemporal regularity. The study of temporal tex-tures dates back to the early nineties. Many researchers in the computer visioncommunity have formulated techniques to analyse temporal textures. This chapteraims to provide a comprehensive literature survey of the existing temporal texturecharacterization techniques.

1 Introduction

Temporal textures are textures with motion like real world image sequences ofsea-waves, smoke, fire, etc. that possess some stationary properties over spaceand time. The motion assembly by a flock of flying birds, water streams, flut-tering leaves, and waving flags also serve to illustrate such motion. Temporaltexture characterization is of vital importance to computer vision, electronicentertainment, and content-based video coding research with a number ofpotential applications in areas including recognition (automated surveillanceand industrial monitoring), synthesis (animation and computer games), andsegmentation (robot navigation and MPEG-4).

The phenomena commonly observed in temporal textures have promptedmany researchers to formulate techniques to analyse these distinctive motionpatterns. Research is mostly devoted towards developing features and modelsfor characterizing temporal texture motion patterns, as observed in the cur-rent literature. Our main focus here is on briefing the working principles oftemporal texture characterization techniques. Besides characterization there

1 Some authors used the term “dynamic texture” [6] to identify similar motionpatterns.

A. Rahman and M. Murshed: Temporal Texture Characterization: A Review, Studies in

Computational Intelligence (SCI) 96, 291–316 (2008)


292 A. Rahman and M. Murshed

are some research works in recent times on synthesis, coding, segmentation,and retrieval of temporal texture image sequences.

The purpose of characterization is to choose a set of characteristic featuresor define a mathematical model from the underlying texture so that imagesequences with similar textures are classified in one class (group). Characteri-zation of temporal textures is performed from the spatiotemporal distributionof dynamics over the image sequences by extracting characteristic spatiotem-poral features. Extraction of features plays an important role in the accuracyof classification. As temporal textures exhibit spatiotemporal regularity ofdynamics with indeterminate spatial and temporal extent, both the spatialand the temporal domain need to be explored exhaustively. Moreover, fromthe real time application point of view, the characterization process has to bequick enough for time sensitive applications. In this chapter we elaborate onthe diverse features used by different characterization techniques and analysetheir effectiveness in utilizing the time–space dynamics.

This chapter is organized as follows. We explain some background conceptsand algorithms in Sect. 2 that are frequently used, while discussing differenttemporal texture analysis techniques. The review is presented in Sect. 3, andSect. 4 concludes this chapter.

2 Background

In this section we elaborate some basic concepts essential to comprehendthe detailed working principle of different temporal texture characterizationtechniques. In Sect. 2.1, we define an image sequence. Some motion estimationapproaches most commonly used by contemporary characterization techniquesare explained in Sect. 2.2. Many temporal texture characterization techniquesoperate on computed motion, and the resulting motion frame sequence isillustrated in Sect. 2.3. In Sect. 2.4, one of the most commonly used motiondistribution statistics, namely the motion co-occurrence matrix, is defined.Section 2.5 describes two standard temporal texture datasets commonly usedby researchers in the experiments.

2.1 Image Sequence

A digital image is a collection of picture elements (pixel or pel) that are usuallyarranged in a rectangular grid. The number of pixels in each column and rowof the grid constitutes the resolution (width × height) of the image, and a pixelis identified by its Cartesian coordinate in the grid. Various colour models areused to distinguish pixels numerically. Of these the most commonly used RGBmodel uses three primary colour (red, green, blue) components, while the HSBmodel, the most intuitive to human perception, uses hue (colour), saturation(concentration of colour), and brightness (intensity) components to representeach pixel. The grayscale model uses just the intensity component and it

Temporal Texture Characterization: A Review 293

Fig. 1. An 8-bit grayscale image of an eye captured in resolution 50 × 42 pixelsand printed (a) dot-for-dot and (b) enlarged without altering the resolution

is widely favoured by the signal processing researchers to avoid unnecessarycomplications due to retaining colour information, especially for cases wherethe intensity information is sufficient such as temporal texture classification.

Resolution plays a significant role on the perceived quality of the image,especially in the context of its physical size, as evident in Fig. 1 where thesame 50 × 42 pixel 8-bit grayscale image (intensity value of each pixel isdrawn from the range [0, 28 − 1]) is printed in two different sizes. Note thatresolution of an image can be altered using subsampling or supersamplingwith interpolation to match the physical size (not applied in Fig. 1b). Butthis requires extra processing, and the quality would not be as good had theimage been captured in that (altered) resolution.

Now consider a sensor located in a specific position of the three dimen-sional (3D) world space, capturing images (frames) about the scene, one afteranother, at a specified frame rate. As time goes by, the images form a sequence,which can be expressed with a brightness function It(x, y) representing theintensity of the pixel at coordinate (x, y) in the image I captured at time t.A digital video is a fitting example of image sequence where images are nor-mally captured at a high enough frame rate (e.g., 25 frames per second inPAL) so that the persistence of vision (0.1 s for most human being) can beexploited to create the illusion of motion.

2.2 Motion Estimation

In the field of signal processing, motion analysis is mainly concerned with the2D motion in the image plane. The translational model is most frequently usedin the field, assuming that the change between two successive frames is dueto the motion of moving objects during the time interval between the frames.In many cases, as long as the frame rate is high enough, the assumption isvalid. By motion analysis, we thus mean the estimation of this translationalmotion in the form of displacement or velocity vectors.

There are two kinds of techniques in 2D motion analysis: correlation anddifferential techniques. The first one belongs to the group of region match-ing, whereas differential techniques are used to compute pixel motion widelyknown as the optical flow. With region matching, the current frame is divided


into non-overlapping regions of interest, and for each region the best matchis searched in the reference frame. Both optical flow and region matchingtechniques are now discussed in detail in the following sections.

Optical Flow

Optical flow is referred to as the 2D distribution of apparent velocities of themovement of intensity patterns in an image plane. In other words, an opticalflow field consists of a dense velocity field with one velocity vector per pixel inthe image plane. If the time interval between two successive frames is known,then velocity vectors and displacement vectors can be computed from one setto the other. In this sense, optical flow is a technique used for displacementestimation.

As optical flow is caused by the movement of intensity patterns ratherthan the objects’ motion, 2D motion and optical flow are generally different.Imagine a uniform sphere rotating with constant speed in the scene. Assumethe luminance and all other conditions do not change at all when frames arecaptured. As there is no change in the brightness patterns, the optical flowis zero; whereas the 2D motion field is obviously not zero. Thus optical flowcannot be estimated based on image intensities alone unless an additionalconstraint, e.g., smoothness of the contour [46], is imposed. Such constraintsare either difficult to implement in practice or are not true over the entireimage.

Apart from the above-mentioned difficulty, the estimation of motion usingoptical flow usually involves iterations that require a long processing time.This may generate a large amount of overheads, rendering a recognition taskinefficient. Although there are some near real time optical flow estimationalgorithms [2,3], the quality of the estimated motion is not adequate to classifytemporal textures accurately [20].

One obvious alternative for real time motion estimation is to estimate theapproximated normal flow, which is orthogonal to the contour and thus thegradient parallel component of the optical flow. It takes only three partialderivatives of the spatiotemporal image intensity function I to estimate thenormal flow. Although the full displacement is not recoverable, partial flowprovides sufficient information for the purpose of motion-based recognition.

Computation of normal flow from an image sequence can be explained byderiving a brightness invariance equation. If we assume that image intensityat a pixel (x, y) in the image plane remains unchanged over time t and t+∆t,we may write [22,46]

It(x, y) = It+∆t(x + ∆x, y + ∆y), (1)

where ∆t, ∆x, and ∆y are miniscule time, horizontal displacement and verticaldisplacement. By expanding this equation and ignoring the higher order terms,we get


∆x∂I

∂x+ ∆y

∂I

∂y+ ∆t

∂I

∂t= 0 , (2)

where ∂I∂x , ∂I

∂y , and ∂I∂t are partial derivatives of the intensity function with

respect to variables x, y and t. Dividing the equation by ∆t, we obtain

∂I

∂x

(∆x

∆t

)+

∂I

∂y

(∆y

∆t

)+

∂I

∂t= 0 (3)

≡ v · grad(I) +∂I

∂t= 0 , (4)

where v =(

∆x∆t , ∆y

∆t

)is the optical flow velocity and grad(I) =

(∂I∂x , ∂I

∂y

)is

its gradient. Without any additional constraint, it is impossible to calculatev from (4), as this Linear equation has two unknowns: x- and y- componentof v. This is formally known as the aperture problem. The gradient parallelcomponent of v, i.e., normal flow vN, can however be computed from (4) as

vN =−∂I

∂t√(∂I∂x

)2+(

∂I∂y

)2u , (5)

where u is the unit vector along the direction of gradient grad(I). The normalflow field is fast to compute [46] and can be directly estimated without anyiterative scheme used by complete optical flow (complete flow) estimationmethods [22]. Moreover, it contains both temporal and structural informationon temporal textures; temporal information is related to moving edges, whilespatial information is linked to the edge gradient vectors. Researchers are thusmotivated to use normal flow to characterize temporal textures, as evidencedin the literature.

Block Matching

The block matching motion estimation approach, where a motion vector isassociated with a block of connected pixels rather than with an individualpixel, is prevalent in the video coding standards such as H.26X and MPEG–1/2/4 ([43,46,47]) due to increased coding efficiency, as fewer motion vectorsare coded. With this approach, a frame is partitioned into non-overlappedblocks (termed macroblocks in video coding that are usually rectangular andof fixed size). For each block thus generated is assumed to move as one, i.e., allpixels in a block share the same displacement vector. For each block, its bestmatch is found within a search window in the previous frame with maximumcorrelation, and the motion vector is computed from the relative displace-ment. Although block based motion vectors are computed with a view toimproving coding efficiency, they still represent some degree of true motionthat is successfully exploited in motion indexing of block-based videos [43],


Search window

Image frame It

Block centred at It−1(x, y)

Block centred at It (x, y)

Image frame It−1

2d+1Best match block

Motion vector

b

y

x

t

a

dd

d

d

2d+1

Fig. 2. Block motion estimation process. The motion vector of a block of sizea × b pixels centred at It(x, y) is estimated by using a search window in frameIt−1 centred at It−1(x, y) and finding the closest block within the search windowwith the maximum correlation. The displacement vector from the search centre tothe centre of this block gives the motion vector. In the search window a total of(2d + 1) × (2d + 1) candidate pixels need to be examined for the full search motionestimation process

motion-based video indexing and retrieval [25], and neighbouring motion vec-tor prediction [53]. Empirical study has also observed that the block motion’srepresentation of ‘true’ motion is significant [49]. This along with its computa-tional efficiency motivates few researchers [37–42] to use block motion vectorsin temporal texture classification.

Figure 2 illustrates a block motion estimation process where an imageframe It is segmented into non-overlapped rectangular blocks of a × b pix-els each. In practice, square blocks of a = b = 16 are widely used. Nowconsider a current block centred at It(x, y). It is assumed that the block istranslated as a whole. Consequently, only one displacement vector needs tobe estimated for this block. In order to estimate the displacement vector, arectangular search window of (a+2d)× (b+2d) pixels is opened in frame It−1

centred at pixel It−1(x, y). Every distinct a × b pixel block within the searchwindow is searched exhaustively by the full search [45] algorithm to find thebest matching block having the maximum correlation with the current blockin frame It It. If multiple blocks have the maximum correlation, the one clos-est to the search centre is preferred mainly for coding efficiency as it resultsin shorter motion vector. The inverse of correlation is usually measured usingMean Squared Error (MSE) or Mean Absolute Error (MAE) of the block pairwhere the error for each pixel position is calculated as the difference in inten-sity values in the co-located position. Once the best matching block is found,


the displacement of its centre from the search centre constitutes the motionvector (∆x,∆y) of the current block where ∆x and ∆y are drawn from therange [−d, d]. Unless an exact match is found earlier, the full search algorithmexhaustively checks all possible (2d + 1)2 blocks within the search window.

In order to reduce the search time, some alternative approaches involvinglogarithmic directional search such as Triple Step Search (TSS) [46], New TSS(NTSS) [23], and Hexagon-shape Based Search (HEXBS) [58] are used thatnormally check between 15 and 30 blocks. These algorithms avoid an exhaus-tive search by following the direction of the gradient of the error surface, whichis assumed unimodal. As this underlying assumption is not necessarily alwaystrue, these fast algorithms are often trapped in local minima, with impact onthe quality of motion estimation. Interestingly, now-a-days there are hardwaredevices like ‘Videolink/4’ [55] and software solutions like ‘Video Insight’ [54]that can render block based MPEG videos in real time while keeping optimalmotion quality and thus making motion vectors readily available in real time,as explained in the following section.

2.3 Motion Frame Sequence

The term motion frame sequence is quite frequently used in this chapter. Wedefine here what we mean by motion frames. A motion frame, computed fromtwo successive image frames (Fig. 3) using any motion estimation algorithm,is a 2D grid. Let Mt denote the t-th motion frame. Each entry in the frame Mt

denotes a motion measure that is either the motion vector or its magnitude ordirection quantized to an integer value. As an example, consider the quantiza-tion process of motion magnitude using the block matching motion estimationalgorithm with maximum displacement of ±d pixels. Motion magnitude k isquantized to motion measure i if

max(0, i − 0.5) ≤ k < min(i + 0.5, d√

2), (6)

where 0 ≤ k < d√

2, d√

2 is the maximum possible vector length with ±dmaximum displacement, and i ∈ [0, Q − 1] where Q represents the number ofpossible motion measures.

Fig. 3. The motion frame of the Flag shown with the motion vectors superimposedon the current image frame where motion was estimated using (a) complete flow; (b)normal flow; and (c) block matching algorithm respectively using (d) two successiveimage frames


2.4 Motion Co-occurrence Matrix

A commonly used motion distribution statistics in the existing temporal tex-ture characterization techniques and also in our proposed technique is theMotion Co-occurrence Matrix (MCM). Let Mt(x, y) denote the motion mea-sure at coordinate (x, y) in the motion frame Mt. With pixel level motionestimation, (x, y) refers to the coordinate of the corresponding pixel; whereaswith block level motion estimation, the pair refers to the 2D indices of thecorresponding block. An MCM is a 2D histogram of motion measure pairs inthe motion frames observed along a clique, defined by a 3D neighbour vec-tor η = (ηx, ηy, ηt) where ηx, ηy, ηt ∈ {. . . ,−1, 0, 1, . . . }. Let Γη denote theMCM along clique η. If Q motion measures are used then Γη can be formallydefined as

Γη(i, j) = |{∀x∀y∀t(x, y, t)|Mt(x, y) = i ∧ Mt+ηt(x + ηx, y + ηy) = j}| , (7)

where i, j ∈ [0, Q − 1]. A neighbourhood is identified by a set of cliques χ ={(ηx, ηy, ηt)}. Cliques with ηt = 0 constitute the spatial neighbourhood andthe cliques with ηt = −1 constitute the temporal neighbourhood as illustratedin Fig. 4.

Let us now consider a step by step process of computing the MCMs for anexample image sequence with just five image frames. For the sake of simplicity,the resolution of these frames is assumed to be low, such that each of theresulting four motion frames has 3 × 3 motion measures, as shown in Fig. 5,estimated using block matching with maximum displacement of ±3 pixel.The length of the motion vector is quantized to motion measure i that coversthe range max(0, i − 0.5) ≤ vectorlength < min(i + 0.5, 3

√2) where 3

√2

is the maximum possible vector length with ±3 maximum displacement andi = 0, 1, . . . , 4. The size of the MCM is then 5 × 5. Figure 6a–c presents theMCMs Γ(0,0,−1), Γ(1,0,0), and Γ(−1,1,−1) respectively. Note that while Γ(0,0,−1)

and Γ(1,0,0) are computed from 27 and 24 possible pairs in the motion frame

Fig. 4. Neighbourhood of a motion measure location, marked in red in motion frameMt. Spatial and temporal neighbours are marked in green and blue, respectively


Fig. 5. An example motion frame sequence where each motion measure is the lengthof the corresponding motion vector rounded to the nearest integer

Fig. 6. MCMs computed from the motion frame sequence in Fig. 5: (a) Γ(0,0,−1);(b) Γ(1,0,0); (c) Γ(−1,1,−1)

sequence along the respective clique, Γ(−1,1,−1) is computed from only 12possible pairs, as some of the motion measures in a motion frame have noneighbour along the clique, as illustrated in Fig. 7.

2.5 Temporal Texture Database

There are two temporal texture datasets in the literature. The most commonlyused Szummer dataset [51] is available since 1996 and recently the dataset wasmoved in R. Paget’s database of temporal textures [30]. The Szummer datasetconsists of a diverse set of temporal textures including boiling water, wavingflags, and wind swept grass, etc. In recent times, the European FP6 Networkof Excellence MUSCLE has launched a set of temporal textures and thisdataset is known as DynTex [34]. The quality (resolution and image quality)of DynTex sequences is better than Szummer sequences.


t

y

x

1 2

2 0 2

3 1 1

0 1 1

0 0

2 4

2 1

1 2 4

1 3 4

2 2 3

4 3

2 4

2

22

0

22

Fig. 7. All the possible 12 neighbouring pairs along clique (−1, 1,−1) on the motionframe sequence in Fig. 5

3 Temporal Texture Characterization Techniques

The existing approaches to temporal texture characterization can be classi-fied into one of the following groups: techniques based on motion distributionstatistics, techniques computing geometric properties in the spatiotemporaldomain, techniques based on spatiotemporal filtering and transforms, andmodel-based methods that use estimated model parameters as features. Thefollowing sections of this chapter elaborate all these categories of characteri-zation techniques. A brief survey on temporal texture analysis techniques isavailable in [5].

3.1 Motion Based Techniques

In this section we focus on elaborating the existing motion based tempo-ral texture characterization techniques. Any motion based temporal texturecharacterization process can, in general, be divided into three cascaded stages:motion estimation, feature extraction, and classification stage. All of the exist-ing characterization techniques compute either normal flow or block motionat motion estimation stage. We thus concentrate on detailing the featurescomputed at their feature extraction stage.

Spatial Feature-Based Technique

Direct use of the normal flow vector field for temporal texture recognitionwas first realized by Nelson and Polana in their study of Spatial Feature-based Texture Recognition (SFTR) [26–28,33]. Several statistical features areexamined, based on distribution of magnitudes and directions of normal flows,


as shown in Fig. 8 for the Fire sequence. Figure 8a depicts computation ofthe normal flow field of the Fire sequence and its magnitude (Fig. 8b) anddirection (Fig. 8c).

The feature set of SFTR technique is presented in Table 1. Non-uniformityin direction of motion is computed from a directional histogram of eight binsby adding the differences between the histogram and the uniform distribution.The inverse coefficient of variation is computed as the ratio of the mean andthe standard deviation of motion magnitudes. Statistics of some flow features,namely estimates of the positive and negative divergence, and positive andnegative curl of the motion field are obtained from the normal flows.

Normal flow distribution features are also derived from the difference sta-tistics. These first order difference statistics are represented by four pixel levelMCMs in the spatial domain Γ(−1,0,0), Γ(−1,1,0), Γ(0,1,0) and Γ(1,1,0). For eachclique used, the ratio of the number of neighbouring pixel pairs differing indirection by at most one to the number of pixel pairs differing by more thanone is computed. Second order features,2 namely spatial homogeneity of theflow, is obtained from the logarithms of the resulting ratios.

Fig. 8. (a) Normal flow field of the Fire sequence: (b) magnitude plot; and(c) direction plot. Magnitude and direction plots are drawn by mapping the magni-tude and direction values into 8-bit grayscale values

Table 1. Feature set of the SFTR technique

Feature ID Feature measure

1 Non-uniformity of flow direction2 Inverse coefficient of variation3 Positive divergence4 Negative divergence5 Positive curl6 Negative curl7 Spatial homogeneity obtained from Γ(−1,0,0)

8 Spatial homogeneity obtained from Γ(−1,1,0)

9 Spatial homogeneity obtained from Γ(0,1,0)

10 Spatial homogeneity obtained from Γ(1,1,0)

2 First order features are computed directly from the motion frames and k-th orderfeatures are computed from (k−1)-th order features.


This study highlighted the computational possibility of using low levelspatial motion features for temporal texture recognition. However, this worklacks any mechanism to handle temporal evaluation, since studied interactionsare purely spatial [32].

Spatiotemporal Clique Neighbourhood Techniques

Fablet and Bouthemy published a series of studies [1,15–19] devoted to recog-nition of temporal texture and other motion patterns. They first introducedthe concept of the temporal co-occurrence matrix of normal flows. Motivatedby the fact that in SFTR there is no mechanism to handle temporal evolution,in their early paper [1] they used standard co-occurrence features (Table 2)namely average, variance, dirac, Angular Second Moment (ASM) and contrastobtained from the temporal MCM Γ(0,0,−1) to discriminate between temporaltextures. Note that computed features are second order features.

Temporal MCM in [1], however, fails to encode any spatial information,and later on the authors developed the Spatiotemporal Clique Neighbour-hood (STCN) technique [16] where the interaction between a pixel and aset of spatially adjacent temporal neighbours (Fig. 4) is encoded by com-puting co-occurrence matrices for each clique in either the entire temporalneighbourhood of nine cliques or a temporal neighbourhood of five cliques{(0, 0,−1), (−1, 0,−1), (0, 1,−1), (1, 0,−1), (0,−1,−1)} to incorporate somedegree of spatial information. A causal spatiotemporal free energy model isused to combine these motion co-occurrence matrices, and the underlyingmodel is optimized by maximizing the free energy using the conjugate gradi-ent method.

Incorporation of spatial information through a set of temporal neighboursin fact is still biased towards time domain and fails to incorporate any signifi-cant spatial motion distribution information. Moreover, the underlying modeloptimization is more focussed towards optimizing the free energy rather thanfeature weights and thus ultimately fails to maintain an appropriate featureweight distribution between time–space domains, leaving room for improve-ment in classification accuracy.

Table 2. The feature set in [1] obtained from a temporal co-occurrence matrix ofnormal flows

Features Mathematical formula

Average avg =∑

(i,j) iP(0,0,−1)(i, j)

Variance σ2 =∑

(i,j)(i − avg)2P(0,0,−1)(i, j)

Dirac dirac = avg2/σ2

Angular second ASM =∑

(i,j)[P(0,0,−1)(i, j)]2

momentContrast Cont =

∑(i,j)(i − j)2P(0,0,−1)(i, j)

Here P0,0,−1 represents normalized MCM Γ0,0,−1


Spatiotemporal Synergistic Approach

With a view to combine the spatial and temporal aspects of temporal tex-tures in a synergistic way, Peh and Cheong developed the Synergizing Spa-tial and Temporal Features (SSTF) technique [31, 32]. Aimed at providing aspatiotemporal analysis on the motion of objects, the magnitudes and direc-tions of normal flows are mapped into grayscale intensity levels for subsequentanalysis. Textures generated in this way are referred to as magnitude plotsand directional plots for magnitudes and directions of the normal flow respec-tively. In order to trace the motion history, the magnitude and directionalplots of successive motion frames are further superimposed independently.Spatiotemporal textures (Fig. 9) extended this way are referred to as the Ex-tended Magnitude Plot (EMP) and Extended Directional Plot (EDP), formagnitudes and directions of the normal flows, respectively.

The feature set of the SSTF technique is presented in Table 3. A sub-set of the features is computed from the extended plots by the Gray LevelCo-occurrence Matrix (GLCM) and the Fourier spectrum analysis. GLCM issimilar to any pixel level spatial-domain MCM except that the former usesgrayscale intensity values instead of motion measures involving only one frame.Conventional co-occurrence features, namely inertia, shade, correlation, andmean are computed from the average of co-occurrence matrices correspond-ing to cliques (−1,0,0), (−1,1,0), (0,1,0) and (1,1,0). Energy centred at 45◦

and 135◦ was computed from the Fourier spectrum. Note that the orders ofcomputed features are high.

Such a representation has the advantage of improving computational effi-ciency, as features need to be computed from one frame only. Merging a long

Fig. 9. Some examples of images with their extended magnitude and directionalplots: (a) texture images; (b) extended magnitude plots; and (c) extended directionplots


Table 3. List of features obtained from the extended plots, EMP and EDP, in theSSTF technique

Analytical Feature Mathematical formulatechnique

GLCM Inertia∑

(i − j)2PG(i, j)Shade

∑(i + j − mx − my)3PG(i, j)

Correlation 1σxσy

∑ijPG(i, j) − mxmy

Fourier Energy centred at 45◦ ∑22.5◦≤tan−1(j/i)<67.5◦ |FT (i, j)|2

spectrum Energy centred at 135◦ ∑112.5◦≤tan−1(j/i)<157.5◦ |FT (i, j)|2

Difference Mean∑

kPG where PG(k) =∑

|i−j|=k PG(i, j)

statistics

Here PG is normalized GLCM, mx and my are GLCM means along x and y axes re-spectively, σx and σy are GLCM standard deviations along x and y axes respectively,and FT is Fourier transform of EDP and EMP

sequence, however, inevitably loses a lot of characteristic spatial informationand thus makes temporal domain more dominant. Moreover, it is difficult tomaintain an appropriate weight distribution between time–space domains.

Optimal Time–Space Ratio Technique

Direct use of block motion for temporal texture characterization was first ad-dressed by Rahman and Murshed in Optimal Time–Space Ratio (OTSR) [38]technique. They proposed to associate a set of motion co-occurrence matri-ces with a temporal texture to encode the temporal and spatial dynamicsseparately. Disassociation of temporal and spatial features was deemed neces-sary to effectively control the weight distribution between these two domains,which is essential in improving the classification accuracy of any temporaltexture characterisation technique using only block motion.

A temporal texture has one temporal dimension and two spatial dimen-sions and in OTSR three instances of motion co-occurrence matrices namelytemporal co-occurrence matrix Γ(0,0,−1) and spatial co-occurrence matricesΓ(1,0,0) and Γ(0,1,0) were computed. Temporal co-occurrence matrix Γ(0,0,−1)

was computed along the conventional frame sequence (Fig. 10a) correspond-ing to direction of time axis thus encoding temporal dynamics and spatialco-occurrence matrices Γ(1,0,0) and Γ(0,1,0) are computed along grid sequences(Fig. 10b,c) corresponding to direction of x- and y- axes respectively thus en-coding spatial dynamics. Note that these motion features use only first orderstatistics and hence suitable to apply on block motion. Moreover, this ex-plicit disassociation of time and space domain features successfully addresses


g1(0,0,−1)

g1(0,1,0)

(a) Grid sequences along clique (0,0,−1). (b) Grid sequences along clique (1,0,0).

(c) Grid sequences along clique (0,1,0).

(0,0,−1)gf−1

(0,1,0)gh

y

h

x

t

wf−1

gw(1,0,0)

yy

(1,0,0)g1

hh

t

wx

t

wxf−1f−1

Fig. 10. Grid sequences along different cliques

the double jeopardy concern and thus making the feature extraction stage ofOTSR technique realisable in real time while aiming to improve the classifi-cation accuracy at block level or even matching the same at pixel level.

Other Motion Based Techniques

Peteri and Chetverikov [35, 36] proposed a technique that combines normalflow features with periodicity features, in an attempt to explicitly characterizemotion magnitude, directionality, and periodicity. The normal flow featuresused is similar to SFTR (e.g., divergence, curl, etc.); however, a novel featureof orientation homogeneity (Fig. 11) of the normal flow field was also intro-duced. In addition, two spatiotemporal periodicity features were proposedbased on the maximal regularity, which is a measure of spatial periodicity ofan image texture. When applied to a temporal texture, the method evalu-ates the temporal variation of spatial periodicity. For each motion frame Mt

of a temporal texture sequence, maximal regularity is computed in a slidingwindow. Then the largest value is selected, corresponding to the most peri-odic patch within the frame. This provides a largest periodicity value for eachMt. The mean and the variance of the largest periodicity value are used as


Fig. 11. Orientation homogeneity for (a) the Escalator and (b) the Smoke sequencescomputed from normal flow vectors. The main orientation is pointed by the triangleand its homogeneity is proportional to the base of the triangle

features. This approach is rotation-invariant, and its periodicity-related partis affine-invariant. Although promising, the temporal regularity method stillneeds to be improved. Like the SFTR technique, it fails to address the problemof integrating temporal information with sufficient importance.

The last motion based characterization technique we are going to elaborateis proposed by Lu et al. [24]. This study is unique in that it uses complete flowvectors. In addition, acceleration vectors are also computed. The 3D structuretensor technique (with spatiotemporal gradient) is used to obtain the completeflow vector by minimising an energy function in a neighbourhood of a pixel.To reduce the effect of the aperture problem, the eigenvectors of the tensor arecalculated and combined into a measure of spatial ‘cornerity’ of the pixel. Thismeasure is used as the weight in the histograms of velocity and acceleration:the higher the confidence of velocity estimation the larger the weight. Toaccount for scale, a spatiotemporal Gaussian filter is applied to decompose asequence into two spatial and two temporal resolution levels. The technique isrotation-invariant, and it provides local directionality information. However,no higher-level structural analysis (e.g., periodicity evaluation) is possible.Moreover, computing complete optical flow is highly time-consuming and thetechnique is unsuitable for real time recognition.

3.2 Geometric Property Based Techniques

Geometric property based techniques focus on properties of surfaces of mo-tion trajectories in spatiotemporal space derived from multiple frames of animage sequence. The motion trajectory (also called trajectory surface) is aset of surfaces swept out by the moving contour in the spatiotemporal spaceof temporal textures. The spatiotemporal properties of the trajectory surfaceare analysed by the studies in this group to characterize temporal textures.Techniques of computing such geometric properties in the spatiotemporal do-main are proposed by Otsuka et al. [29] and its modification by Zhong andSclaroff [57].


Tangent line

Tangent planes

Moving direction

Intersection lineIntersection

Contour

xx

t

y

yTrajectory surface

(a) (b)

Fig. 12. (a) Tangent line on moving contour and intersection point in 2D space;(b) tangent plane and intersection line in 3D space

In the framework in [29], trajectory surfaces are represented as a set oftangent planes of the surfaces. Tangent planes can be explained from Fig. 12.Consider two pints on an object’s contour in the 2D image plane (Fig. 12a).Two non-parallel tangent lines of the two points have only one point of in-tersection. From the spatiotemporal point of view, the contours of objectswill sweep out 2D surfaces, i.e., trajectory surfaces and a tangent line willsweep out a tangent plane of the trajectory surface (Fig. 12b). The tangentplane thus characterizes the trajectory surface, and a set of spatial featuresregarding shape and spatial arrangement of the contour is obtained from thetangent plane distribution of the constraint surface (Table 4).

For temporal features, the authors focused on the distribution of imagevelocity obtained from the intersection lines of the tangent planes. The pointof intersection of the tangent lines becomes an intersection line of the tangentplanes (Fig. 12b). These intersection lines have an orientation that equals themotion trajectory of the object. Image velocity is approximated by detect-ing the orientation of the intersection lines formed by tangent planes on thetrajectory surface. Temporal features are obtained from the tangent planedistribution in the image velocity direction (Table 4).

This technique based on the tangent plane distribution is robust againstnoise and occlusion. It allows discontinuities in image brightness, while manygradient based methods require smoothness. The main drawback of the tech-nique is that it requires large computational time and storage. Moreover, itis highly unlikely for temporal textures to have a dominant trajectory surfacewith indeterminate spatial and temporal extent. Even if such a surface exists,extraction of the surface is not an easy task, and the accuracy of the techniqueis questionable [32].

Trajectory surfaces are obtained in [29] by image differencing followed bybinary quantization and then identifying points in the image volume that be-long to the trajectory surface. The quantization in fact introduces noise in sucha process, which motivates the authors in [57] to deal with such limitationsand also consider some second order surface features, while only first order


Table 4. List of features obtained from trajectory surfaces in [29]

Domain Features Properties

Spatialfeatures

Directionality of contourarrangement

Sharpness measure of direction of tangentplanes.

Scattering of contourplacement

Interval of the tangent lines of contours onimage plane.

Temporalfeatures

Uniformity of velocitycomponents

A measure of diversity of motion vectors.

Flash motion ratio Amount of very rapid motion and theabrupt appearance and disappearance ofobjects.

Run length of trajectory Characterizes sparkle and patternswherein objects repeatedly appear anddisappear quite rapidly.

Input temporaltexture

Input temporaltexture

3D edgeextraction

3D edgeextraction

Featureextraction

(a)

(b)

Featureextraction

GMMtraining

GMMclassification

Fig. 13. Flowchart of the characterization technique in [57]: (a) training, and (b)classification

surface features are used in [29]. Second order surface features are extractedfrom the curvatures on the trajectory isosurface generated by the temporaltexture. An overview of the technique is presented in Fig. 13.

A 3D gradient filter is applied first to extract the edges at every imagepoint in the sequence. Based on the gradient vector, likely trajectory pointsare obtained in the spatiotemporal space. The original spatiotemporal imagesequence volume is separated into a number of small cubic voxels. The featurevectors are extracted only on the edge voxels composing the 3D edge surfaces.These features include the tangent plane direction, edge strength, and theprincipal curvatures. The mean of the feature vector is computed within eachsmall cube. A Gaussian Mixture Model (GMM) is then used to model thedistribution of the features in the feature space. This enables the system tocapture the variations of the feature vectors in the different spatiotemporallocations for one single temporal texture.

The features in [57] are extracted from intensity volume data instead ofbinary data, thus avoiding loss of information due to quantization noise. How-ever, application of gradient filter for edge detection at every pixel is a timeconsuming process, thus the technique is unsuitable for real time applications.


3.3 Spatiotemporal Filtering and Transform Based Techniques

The study by Wildes and Bergen [56] addresses the only method based on lo-cal spatiotemporal filtering. Qualitative classification of local motion structureinto categories such as stationary, coherent, incoherent, flickering and scintil-lating motion is performed by the analysis of local spatiotemporal pattern, itsorientation, and energy. The correlation between the qualitative features andthe character of motion is established by experimental results in [56], assum-ing that small and short temporal textures are considered. However, motionin different parts of a temporal texture can be different, and collectively theyrepresent the dynamics of the texture. No method so far, including the studyin [56], provides any guideline to combine the local qualitative values into aglobal description, or to characterize the fundamental structural properties ofthe entire temporal texture.

There have been attempts [48] at video texture indexing using globalspatiotemporal transforms. The emerging use of global spatiotemporal trans-forms indicates the necessity to characterize motion at different spatiotem-poral scales. Spatiotemporal wavelets can decompose motion into local andglobal, according to the desired degree of detail. For example, a tree wavingin the wind shows a coarse motion of trunk, a finer motion of branches and astill finer motion of leaves. The periodicities of these motions are also differ-ent, resulting in energy maxima at different scales. These effects are capturedby spatiotemporal wavelets.

The Discrete Wavelet Transform (DWT) is a linear transformation toolthat separates data into different frequency components and then studies eachcomponent with a resolution matched to its scale. The 3D wavelet transform(Fig. 14) applied in [48] can be viewed as an extension of spatial image textureanalysis using 2D wavelets. DWT is performed firstly for all image rows andthen for all columns. With 3D DWT, each temporal texture image sequence isdecomposed using three iterations of a 3D wavelet filter bank. Each iterationof the wavelet filter bank results from one iteration on each spatial dimension

Fig. 14. Classical 3D wavelet transform scheme


and two iterations on the temporal dimension. Each iteration produces a re-duction by a factor of four in the spatial and temporal dimensions. The waveletdecomposition is repeated by iterating on the low frequency sub-band of thewavelet filter bank until the desired degree of detail is produced.

From the wavelet filter bank, two methods are used in [48] for extractingvideo texture descriptors. The first method computes energy across all of theframes in order to capture the degree of temporal dynamicity. This methodis suitable for describing video content such as continuous flapping of wingsin a flock of birds. The second method treats each frame of the sub-bandsas a different dimension in the video texture descriptor in order to capturetemporally evolving texture. This method is suitable for describing video con-tent such as the rippling in a puddle that dies out over time after a stone isdropped.

Use of spatiotemporal wavelets is highly motivated by the fact thattemporal textures possess typical textural properties such as randomness,periodicity, and directionality that can be captured by wavelet transforms.Another argument in favour of wavelets is the fact that the MPEG-7 multi-media standard proposes the use of Gabor wavelet features for image texturebrowsing and retrieval. However, a strong argument against global spatiotem-poral transforms is the difficulty of providing rotation invariance due to theirdirect reliance on pixel intensity values.

3.4 Model Based Techniques

Model-based temporal texture recognition techniques use a framework basedon a system identification theory which estimates the parameters of a stabledynamic model. These techniques are suitable for both recognition and synthe-sis. Recognition is performed using estimated model parameters as features.Synthesis is performed by applying model parameters on a seed image frameto predict future image frames. Different model based recognition techniquesare discussed in the following four sections.

Spatiotemporal Auto-Regressive Model

Modelling image sequences of temporal textures using the linear Spatiotem-poral Auto-Regressive (STAR) model was first proposed by Szummer andPicard [50–52]. It is a 3D extension of the popular Auto-Regressive (AR)models that are among the best models for recognition and synthesis of imagetextures. In this technique, every pixel is modelled as a linear combination ofneighbouring pixels lagged in time and space plus a noise term as follows:

It(x, y) =∑∀i

AiIt+∆ti(x + ∆xi, y + ∆yi) + et(x, y), (8)

where et(x, y) is a Gaussian white noise process and the lags ∆xi, ∆yi, and ∆tispecify the neighbourhood structure of the model. Parameters of the STAR


model, Ai’s are learned by minimizing the mean square prediction error. Onlylinear models are examined by this study to model Steam, Boiling water,and River sequences convincingly. However, it fails to capture any rotationalmodel due to its reliance on pixel intensities directly, and also the are highlytime consuming due to large number of model parameters.

Auto-Regressive Moving Average Process

Criticizing STAR model’s failure to capture rotational and non-translationalmotions, Doretto et al. [6] proposed applying the model on lower dimensionalrepresentation [11, 14, 44] of the image as such representations are rotationinvariant. Assuming that {It}t=1,...,f be a sequence of images and at = It + �t

be a noisy version of the image with �t being an Independent and IdenticallyDistributed (IID) sequence drawn from distribution p(·), a dynamic textureis associated to an Auto-Regressive Moving Average process with Unknowninput (ARMAUX) {

at = A1at−1 + · · · + Akat−k + Bet;ct = ς(at) + �t,

(9)

where at is a lower dimensional representation of image It obtained usingfilter ς such that It = ς(at) and et is the realization from a stationary distri-bution ps(·) with stationary invariant statistics for some choices of matricesA1, . . . , Ak, B and initial condition a0 = ξ. Assuming ς(at) = Pcat with Pc

being a set of principal components or a wavelet filter bank, parameters ofthe model are inferred using Maximum Likelihood Estimation (MLE) as

given a0, c1, . . . , cf ,findA, B, ps(·) = arg maxA,B,ps(·) log p(c1, . . . , cf )

such that{

at+1 = Aat + Bet

ct = Pcat + bt

and etIID≈ ps(·) .

(10)

Although the ARMAUX model achieves impressive results [6–14] for recog-nizing and synthesizing temporal textures, the applicability of the techniqueto real videos is doubtful for several reasons [20]: this technique is highlytime consuming due to the large number of model parameters; moreover it isdifficult to define a similarity metric in the space of dynamic models.

Modelling Using Impulse Responses of State Variables

In order to eliminate the difficulty of defining the similarity metric, Fujita andNayar [21] modified the approach in [44] by using impulse responses of statevariables to identify temporal textures. From the viewpoint of system iden-tification, the impulse responses capture the inherent dynamical properties,


and at the same time they are very efficient to compute and compare. Therecognition scheme is divided into two stages.

In the learning stage, the original image sequences are first divided intolocal block sequences. These blocks are labelled accordingly to the types oftextures they contain. Next, the ARMAUX is applied to each block sequenceto obtain the model parameters A, B, and ps(·). Then, from the A matrix foreach block sequence, n-dimensional impulse responses are computed as

κk+1 = Aκk , (11)

where κk ∈ �n, A ∈ �n×n, and κ0 = [1, 1, . . . , 1]T . The impulse responsesof all the blocks that belong to the same texture (same label) are used tocompute a linear space, and then they are mapped to this space to obtaintrajectories.

In the recognition stage, the model parameter matrix A for a given novelblock sequence is used to compute n-dimensional impulse responses. Theseimpulse responses are mapped to trajectories in each of the linear spaces(corresponding to different textures) that were computed in the learning stage.Finally, to recognize the dynamic texture in a novel block sequence, a nearestpoint search is conducted. Despite its superiority over the basic ARMAUXmodel, this approach still suffers the problem of heavy computational loaddue to the large number of model parameters.

Other Model Based Techniques and Summary

Chan and Vasconcelos [4] introduced a framework for the classification oftemporal textures modelled with ARMAUX models using probabilistic ker-nels. The new framework combines the modelling power of ARMAUX modelsfor temporal textures and the generalization guarantees, for classification, ofthe support vector machine classifier. This combination is achieved by thederivation of a new probabilistic kernel based on the KL divergence betweenGauss–Markov processes. The kernels cover a large variety of video classifica-tion problems, including the cases where classes can differ in both appearanceand motion and the cases where appearance is similar for all classes and onlymotion is discriminant. However, due to using ARMAUX models, this ap-proach is also highly computationally expensive.

A careful scrutiny of the above-mentioned model based temporal tex-ture characterization techniques reveals that, although impressive results areachieved using model based approaches, their application for recognition in areal world environment is limited due to heavy computational load.

4 Conclusion

In this chapter we have presented a set of studies on temporal texture analysis.More precisely we focused on characterization techniques of temporal textures


and have provided a detailed categorization of the different characterizationprocesses. In conclusion, temporal texture analysis is a novel, exciting, anddeveloping research area. We hope that the research focussed on temporaltexture characterization, will make a significant contribution to expand theapplicability of temporal texture analysis in many unexplored areas.

References

1. Bouthemy P., Fablet R.: Motion characterization from temporal cooccurrencesof local motion-based measures for video indexing. International Conference onPattern Recognition (ICPR). Volume 2, Brisbane, Australia (1998) 905–908.

2. Brox T., Bruhn A., Papenberg N., Weickert J.: High accuracy optical flowestimation based on a theory for warping. European Conference on ComputerVision (ECCV). Volume 4, Prague, Czech Republic (2004) 25–36.

3. Bruhn A., Weickert J., Feddern C., Kohlberger T., Schnorr C.: Real-time opticflow computation with variational methods. CAIP. Groningen, The Netherlands(2003) 222–229.

4. Chan A. B., Vasconcelos N.: Probabilistic Kernels for the Classification of Auto-regressive Visual Processes. IEEE computer society Conference on ComputerVision and Pattern Recognition (CVPR). San Diego (2005).

5. Chetverikov D., Peteri R.: A brief survey of dynamic texture descriptionand recognition. International Conference on Computer Recognition Systems(CORES). Rydzyna, Poland (2005).

6. Doretto G., Chiuso A., Soatto S., Wu Y. N.: Dynamic textures. InternationalJournal of Computer Vision (IJCV). Volume 51 (2003) 91–109.

7. Doretto G., Jones E., Soatto S.: Spatially homogeneous dynamic textures.European Conference on Computer Vision (ECCV). Volume 2, Prague, CzechRepublic (2004) 591–602.

8. Doretto G., Soatto S.: Towards plenoptic dynamic textures. International Work-shop on Texture Analysis and Synthesis. Nice, France (2003) 25–30.

9. Doretto G., Soatto S.: Editable dynamic textures. IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR). Volume 2,Madison, Wisconsin (2003) 137–142.

10. Doretto G., Soatto S.: Editable dynamic textures. Conference Abstracts andApplications of SIGGRAPH. San Antonio, Texas (2002) 177.

11. Doretto G.: Dynamic texture modeling. M.S. Thesis. Computer Science De-partment, University of California, Los Angeles, California (2002).

12. Doretto G., Soatto S.: Towards plenoptic dynamic textures. UCLA Com-puter Science Department Technical Report (#020043). Los Angeles, California(2002).

13. Doretto G., Soatto S.: Editable dynamic textures. UCLA Computer ScienceDepartment Technical Report (#020001). Los Angeles, California (2002).

14. Doretto G., Pundir P., Wu Y. N., Soatto S.: Dynamic textures. UCLA Com-puter Science Department Technical Report (#200032). Los Angeles, California(2000).

15. Fablet R., Bouthemy P.: Motion based feature extraction and ascendant hier-archical classification for video indexing and retrieval. International Conferenceon Visual Information Systems (1999) 221–228.


16. Fablet R., Bouthemy P., Perez P.: Nonparametric motion characterization usingcasual probabilistic models for video indexing and retrieval. IEEE Transactionson Image Processing. Volume 11 (2002) 393–407.

17. Fablet R., Bouthemy P.: Motion recognition using nonparametric image motionmodels estimated from temporal and multiscale co-occurrence statistics. IEEEtransaction on pattern analysis and machine intelligence. Volume 25 (2003)1619–1624.

18. Fablet R., Bouthemy P.: Non parametric motion recognition using temporalmultiscale Gibbs models. IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR). Volume 1 (2001) 501–508.

19. Fablet R., Bouthemy P.: Motion recognition using spatio-temporal randomwalks in sequence of 2D motion-related measurements. IEEE International Con-ference on Image Processing (ICIP). Thessalonique, Greece (2001) 652–655.

20. Fazekas S., Chetverikov D.: Normal versus complete flow in dynamic texturerecognition: a comparative study. International workshop on texture analysisand synthesis at International Conference on Computer Vision (ICCV). Beijing,China (2005).

21. Fujita K., Nayar S. K.: Recognition of dynamic textures using impulse responsesof state variables. International Workshop on Texture Analysis and Synthesis(2003) 31–36.

22. Horn B., Schunck B.: Determining optical flow. Artificial Intelligence.Volume 17 (1981) 185–203.

23. Li R., Zeng B., Liou M. L.: A new three-step search algorithm for block motionestimation. IEEE Transaction on Circuits and Systems for Video Technology.Volume 4 (1994) 438–442.

24. Lu Z., Xie W., Pei J., Huang J. J.: Dynamic Texture Recognition by Spatio-Temporal Multiresolution Histograms. IEEE Workshop on Motion and VideoComputing (WACV/MOTION). Volume 2 (2005) 241–246.

25. Ma Y. F., Zhang H. J.: Motion texture: a new motion based video representa-tion. International Conference on Pattern Recognition (ICPR). Volume 2 (2002)548–551.

26. Nelson R. C., Polana R.: Recognition of motion using temporal texture. IEEEcomputer society Conference on Computer Vision and Pattern Recognition(1992) 129–134.

27. Nelson R. C., Polana R.: Qualitative recognition of motion using temporaltexture. CVGIP image understanding. Volume 56 (1992) 78–89.

28. Nelson R. C., Polana R.: Temporal Texture Analysis. DARPA Image Under-standing Workshop (1992) 555–559.

29. Otsuka K., Horikoshi T., Suzuki S., Fujii M.: Feature extraction of temporaltexture based on spatiotemporal motion trajectory. International Conferenceon Pattern Recognition (ICPR). Volume 2 (1998) 1047–1051.

30. Paget R.: Texture synthesis and analysis. http://www.vision.ee.ethz.ch/∼rpaget/links.htm (Last accessed in January, 2006).

31. Peh C.H., Cheong L.F.: Exploring video content in extended spatiotempo-ral textures. European workshop on Content-Based Multimedia Indexing.Toulouse, France (1999) 147–153.

32. Peh C. H., Cheong L. F.: Synergizing spatial and temporal texture. IEEE Trans-actions on Image Processing (2002) 1179–1191.

33. Polana R., Nelson R. C.: Temporal texture and activity ecognition. M. Shah andR. Jain, editors. Motion-Based Recognition. Kluwer, Dordrecht (1997) 87–115.


34. Peteri R., Huskies M.: DynTex: A comprehensive database of Dynamic Tex-tures. http://www.cwi.nl/projects/dyntex/ (Last accessed in June 2007).

35. Peteri R., Chetverikov D.: Dynamic texture recognition using normal flow andtexture regularity. LNCS. Iberian Conference on Pattern Recognition and ImageAnalysis (2005).

36. Peteri R., Chetverikov D.: Qualitative characterization of dynamic textures forvideo retrieval. International Conference on Computer Vision and Graphics(ICCVG). Warsaw, Poland (2004).

37. Rahman A., Murshed M.: A temporal texture characterization technique usingblock-based approximated motion measure. IEEE Transaction on Circuits andSystems for Video Technology. Volume 17 (2007) 1370–1382.

38. Rahman A., Murshed M.: Real-time temporal texture characterization usingblock-based motion co-occurrence statistics. IEEE International Conference onImage Processing (ICIP). Singapore (2004) 1593–1596.

39. Rahman A., Murshed M.: A robust optical flow estimation algorithm for tempo-ral textures. IEEE International Conference on Information Technology: Codingand Computing (ITCC). Las Vegas, USA (2005) 72–76.

40. Rahman A., Murshed M.: A motion-based approach for temporal texture syn-thesis. IEEE Region 10 Conference on Convergent Technologies (TENCON).Melbourne, Australia (2005).

41. Rahman A., Murshed M.: Feature weighting methods for abstract features ap-plicable to motion based video indexing. IEEE International Conference on In-formation Technology: Coding and Computing (ITCC). Las Vegas, USA (2004).

42. Rahman A., Murshed M.: Multi center retrieval (MCR) technique applicableto motion based video retrieval. International Conference of Computer andInformation Technology (ICCIT). Dhaka, Bangladesh (2004) 347–350.

43. Richardson I. E. G.: H.264 and MPEG-4 Video Compression. Wiley, Chichester,(2003).

44. Saisan P., Doretto G., Wu Y. N., Soatto S.: Dynamic texture recognition. IEEEComputer Society Conference on Computer Vision and Pattern Recognition(CVPR). Volume 2, Kauai, Hawaii (2001) 58–63.

45. Schodl A., Szeliski R., Salesin D., Essa I.: Video textures. ACM SIGGRAPH(2000).

46. Shi Y. Q., Sun H.: Image and Video Compression for Multimedia Engineering.CRC, Boca Raton, (2000).

47. Sikora T.: MPEG digital video-coding standards. IEEE Signal Processing Mag-azine. Volume 14 (1997) 82–100.

48. Smith J. R., Lin C. Y., Naphade M.: Video texture indexing using spatiotem-poral wavelets. IEEE International Conference on Image Processing (ICIP).Volume 2 (2002) 437–440.

49. Sorwar G., Murshed M., Dooley L. S.: Filtering of block motion vectors for usein motion-based video indexing and retrieval. IEICE Transactions (2005).

50. Szummer M., Picard R. W.: Temporal texture modeling. IEEE InternationalConference on Image Processing (ICIP). Lausanne, Switzerland (1996) 823–826.

51. Szummer M.: Temporal texture modeling. M. Engg. Thesis. MIT (1996).52. Szummer M.: Temporal texture modeling. Technical report (#346). MIT

(1995).53. Turaga D. S., Tsuhan C.: Estimation and mode decision for spatially corre-

lated motion sequences. IEEE Transaction on Circuits and Systems for VideoTechnology. Volume 11 (2001).


54. Videoinsight, http://www.video-insight.com/dvr005.htm (Last accessed in2005).

55. Videolink/4, http://www.compumodules.com/security/mpeg-4-encoder.shtml(Last accessed in July 2005).

56. Wildes R. P., Bergen J. R.: Qualitative spatiotemporal analysis using an ori-ented energy representation. European Conference on Computer Vision (2000)768–784.

57. Zhong J., Sclaroff S.: Temporal texture recongnition model using 3D features.Technical report. MIT Media Lab Perceptual Computing (2002).

58. Zhu C., Chau L. P., Lin X.: Hexagon-based search pattern for fast block motionestimation. IEEE Transaction on Circuits and Systems for Video Technology.Volume 12 (2002) 349–355.

Part IV

Computational Intelligence in MultimediaNetworks and Task Scheduling

Real Time Tasks Scheduling Using HybridGenetic Algorithm

Mitsuo Gen1 and Myungryun Yoo2

1 Graduate School of Information, Production and Systems, Waseda University,Japan, [email protected]

2 Department of Computer Science and Media Engineering, Musashi Institute ofTechnology, Japan, [email protected]

Summary. The objective of the scheduling soft real-time tasks is to minimize to-tal tardiness and the scheduling these tasks on multiprocessor system is NP-hardproblem. In this chapter, scheduling algorithms for soft real-time tasks using geneticalgorithm (GA) are introduced. GA has been known to offer significant advantagesagainst conventional heuristics by using simultaneously several search principles andheuristics.

The objective of this study is to propose reasonable solutions for NP-hardscheduling problem which much less difficulties than those of traditional mathe-matical methods.

A continuous task scheduling, real-time task scheduling on homogeneous systemand real-time task scheduling on heterogeneous system are included in this chapter.

1 Introduction

Real-time tasks can be classified to many kinds. Some real-time tasks areinvoked repetitively. For example, one may wish to monitor the speed, altitude,and attitude of an aircraft every 100 ms. This sensor information will be usedby periodic tasks that control the control surfaces of the aircraft, in orderto maintain stability and other desired characteristics. In contrast, there aremany other tasks that are aperiodic, that occur only occasionally. A periodictasks with a bounded interarrival time are called sporadic tasks. Real-timetasks can also be classified according to the consequences of their not beingexecuted on time. Critical (or hard real-time) tasks are those whose timelyexecution is critical. If deadline are missed, catastrophes occur. Noncritical (orsoft real-time) tasks are, as name implies, not critical to the application [1].

For task scheduling, the purpose of general task scheduling is fairness whichmeans that the computer’s resources must be shared out equitably amongusers. However, the purpose of hard real-time task scheduling is to execute,by the appropriate deadlines, its critical control tasks and the objective of thescheduling soft real-time tasks is to minimize total tardiness [2].M. Gen and M. Yoo: Real Time Tasks Scheduling Using Hybrid Genetic Algorithm, Studies in



320 M. Gen and M. Yoo

There are some traditional scheduling algorithm for hard real-time taskson uniprocessor, such as rate monotonic (RM) and earliest deadline first(EDF) [3] scheduling algorithm. They guarantee the optimality in somewhatrestricted environments. Several derived algorithm from RM, EDF is used forsoft real-time tasks. However, these algorithms have some drawbacks in re-source utilization and pattern of degradation under the overloaded situation.With the growth of soft real time applications, the necessity of schedulingalgorithms for soft real-time tasks is on the increase. Rate regulating propor-tional share (rrPS) [4] scheduling algorithm and modified proportional share(mPS) [5] scheduling algorithm are designed for soft real-time tasks. However,these algorithms also cannot show the graceful degradation of performanceunder an overloaded situation and are restricted in a uniprocessor system.

Furthermore, the scheduling on multiprocessor system is NP-hard prob-lem. According to Yalaoui and Chu [6], the problem of scheduling tasks onidentical parallel processors to minimize the total tardiness is at least NP-hardproblem since Du and Leung showed that the problem is NP-hard problemfor a single processor case [7]. Lenstra et al. also showed that the problemwith two processors is NP-hard problem [8]. Nevertheless the exact complex-ity of this problem remains open for more than two processors. Consequentlyvarious modern heuristics based algorithms have been proposed for practicalreason.

In this chapter, scheduling algorithms for soft real-time tasks using geneticalgorithm (GA) are introduced. GA has been known to offer significant advan-tages against conventional heuristics by using simultaneously several searchprinciples and heuristics. GA has been used already for scheduling problem inmanufacturing system such as job shop scheduling, flow shop scheduling andmachine scheduling.

The objective of this study is to propose reasonable solutions for NP-hard scheduling problem which much less difficulties than those of traditionalmathematical methods.

This chapter is organized as seven sections. Section 2 describes the real-time task scheduling and Sect. 3 presents the basic definition and imple-mentation procedure of genetic algorithm (GA). Scheduling algorithms forcontinuous task are introduced in Sect. 4. Section 5 introduces real-time taskscheduling algorithms on homogeneous multiprocessor system and Sect. 6 in-troduces real-time task scheduling algorithms on heterogeneous multiproces-sor system. Finally, Sect. 7 provides a conclusion of this chapter.

2 Real-Time Task Scheduling Problem

Real-time tasks are characterized by computational activities with timingconstraints and classified into two categories: Hard real-time task and Softreal-time task. In this section, the scheduling problem for real-time tasks isexplained.

Real Time Tasks Scheduling Using Hybrid Genetic Algorithm 321

2.1 Hard Real-Time Task Scheduling

In hard real-time task, the violation of timing constraints of certain taskshould not be acceptable [9]. The consequences of not executing a task beforeits deadline may lead to catastrophic consequences in certain environmentsi.e., in patient monitoring system, nuclear plant control, etc.

In hard real-time task, the performance of scheduling algorithm is mea-sured by its ability to generate a feasible schedule for a set of real-time tasks.If all the tasks start after their release time and complete before their dead-lines, the scheduling algorithm is feasible. Typically, there is rate monotonic(RM) and earliest deadline first (EDF) derived scheduling algorithms for hardreal-time tasks [10,11]. They guarantee the optimality in somewhat restrictedenvironments.

• Rate monotonic (RM) scheduling algorithm. RM Scheduling is proposed byLiu and Layland [3]. This scheduling algorithm is based on uniprocessorstatic-priority preemptive scheme and optimal among all fixed priorityscheduling algorithm. This algorithm assigns static priority to the taskssuch that tasks with shorter period get higher priorities.If the total utilization of the tasks is no greater than n(21/n − 1), wheren is the number of tasks to be scheduled, then RM scheduling algorithmwill schedule all the tasks to meet their respective deadlines.

• Earliest deadline first (EDF) scheduling algorithm. EDF scheduling is pro-posed by Liu and Layland [3]. This scheduling algorithm is based onuniprocessor dynamic-priority preemptive scheme and optimal among alldynamic priority scheduling algorithm. This algorithm schedules the taskwith the earliest deadline first.If the total utilization of the task set is no greater than 1, the task set canbefeasibly scheduled on a single processor by the EDF scheduling algorithm.

There are several RM and EDF derived algorithms for soft real-time task.But, these algorithms have some drawbacks to cope with continuous tasksin soft real-time tasks related resource utilization and pattern of degradationunder the overloaded situation. Firstly, in continuous tasks, it is not nec-essary for every instance of a repetitive task to meet its deadline. For softreal-time tasks, slight violence of time limits is not so critical. Secondly, RMand EDF scheduling algorithms are required the strict admission control toprevent unpredictable behaviors when the overloaded situation occurs. Thestrict admission control may cause low utilization of resources [4].

2.2 Soft Real-Time Task Scheduling

In soft real-time system (ex. telephone switching system, image processing,etc.), the usefulness of results produced by a task decreases over time afterthe deadline expires without causing any damage to the controlled environ-ment [1].


Recently, rate regulating proportional share(rrPS) and modified propor-tional share (mPS) scheduling algorithm is designed for continuous task insoft real-time tasks.

• Rate regulating proportional share (rrPS) scheduling algorithm. rrPSscheduling algorithm is proposed by Kim et al. [4]. rrPS scheduling al-gorithm is based on the stride scheduler and is proposed to schedulecontinuous tasks.

The rate regulator, the key concept of the scheduling algorithm, pre-vents certain tasks from taking more resources than its share for a givenperiod.

This algorithm considers time dependency of continuous media and itkeeps fairness of resource allocation under normal scheduling condition.Even though rrPS scheduling algorithm has several the advantages, it hassome difficulties to adapt continuous media as follows.

First, this algorithm does not show graceful degradation of performanceunder the overloaded condition.

Second, this algorithm also has the possibility of avoidable contextswitching overhead.

• Modified proportional share (mPS) scheduling algorithm. mPS schedulingalgorithm is proposed by Yoo et al. [2]. This scheduling algorithm considersthe ratio of resource allocation in both normal condition and overloadedcondition.

This scheduling algorithm shows better performance than rrPS schedul-ing algorithm for graceful degradation of performance under the overloadedcondition and fewer context switching. However, computational burdenand solution accuracy of mPS could be improved by new algorithm basedon genetic algorithm (GA).

3 Hybrid Genetic Algorithm

Genetic algorithm (GA) has been used already for scheduling problem inmanufacturing system such as job shop scheduling, flow shop scheduling andmachine scheduling. GA has been theoretically and empirically proved to pro-vide a robust search in complex search spaces. Having been established as avalid approach to the complex problems requiring effective search, GA is nowfinding more widespread application in business, scientific, and engineeringcircles. The reasons behind the growing numbers of applications are clear.This algorithm is computationally simple and also powerful in its search forthe improvement of the solution. In this section, the basic concept of GAand the expanded multiobjective optimization problem is explained.

3.1 Basic of Genetic Algorithm

The general form of GA was described by Goldberg [12]. GAs are stochasticsearch algorithms based on the mechanism of natural selection and natural


Fig. 1. General structure of GA

genetics. GAs, differing from conventional search techniques, start with an ini-tial set of random solutions called population. Each individual in the popula-tion is called a chromosome, encoding a solution to the problem at hand. Thechromosomes evolve through successive iterations, called generations. Dur-ing each generation, the chromosomes are evaluated, using some measuresof fitness [13]. To create the next generation, new chromosomes, called off-spring, are formed by either merging two chromosomes from current genera-tion using a crossover operator or modifying a chromosome using a mutationoperator. A new generation is formed by the selection of good individualsaccording to their fitness values. After several generations, the algorithm con-verges to the best individual, which hopefully represents the optimal solutionor near-optimal solution for the problem. Figure 1 shows a general structureof GA [14].

The general implementation structure of GA is described as follows. Inthis procedure, P (t) and C(t) is parents and offspring in current generation t,

procedure 3.1: genetic algorithminput: problem data and GA parametersoutput: a best solutionbegint ← 0;initialize P (t) by encoding routine;evaluate P (t) by decoding routine;while (not termination condition) docrossover P (t) to yield C(t);mutation P (t) to yield C(t);evaluate C(t) by decoding routine;select P (t+1) from P (t) and C(t);t ← t+1;endoutput a best solution;end


During iteration t, the GA maintains a population P (t) of solution. Eachindividual represents solution to the problem at hand. Each solution is eval-uated by computing a measure of fitness of the solution. Some individualsundergo stochastic transformations by means of genetic operators such ascrossover and mutation operators in order to form new individuals, and new in-dividuals called offspring C(t) are then evaluated. A new population is formedby selecting good individuals from the parent population and the offspringpopulation.

3.2 Multiobjective Optimization Problems

During the last two decades, the genetic algorithms have received considerableattention regarding their potential as a novel approach to the multiobjectiveoptimization problems, known as evolutionary multiobjective optimization orgenetic multiobjective optimization.

Multiobjective optimization problem with q objectives and m constraintswill be formulated as follows:

max {z1 = f1(x), z2 = f2(x), . . . , zq = fq(x)} (1)s.t. gi(x) ≤ 0, i = 1, 2, . . . ,m, (2)

The Concept of Pareto Solution

In the most existing methods, the Pareto solutions are identified at each gener-ation and used to calculate fitness values or ranks for each chromosome only.The Pareto solution is based on nondominated concept. For a given pointz0 ∈ Z, it is nondominated if and only if there does not exist another pointz ∈ Z such that, for the maximization case,

zq ≥ z0q , for all q, (3)

zq > z0q , for at least one q, (4)

where, z0 is a dominated point in the criterion space Z.No mechanism is provided to guarantee that the Pareto solutions generated

during the evolutionary process enter the next generation. A special pool forpreserving the Pareto solutions is added onto the basic structure of geneticalgorithms. At each generation, the set of Pareto solutions E(t) is updatedby deleting all dominated solutions and adding all newly generated Paretosolutions [15].

The overall structure of the approach is given as follows:

procedure 3.2: Pareto genetic algorithmsinput: Problem data and GA parametersoutput: a compromised solution


begint ← 0;initialize P (t) by encoding routine;objective P (t) by decoding routine;create Pareto E(t);fitness eval(P ) by decoding routine;while (not termination condition) docrossover P (t) to yield C(t);mutation P (t) to yield C(t);objective C(t) by decoding routine;update Pareto E(P,C);fitness eval(P,C) by decoding routine;selection P (t+1) from P (t) and C(t);t ← t +1;endoutput a compromised solutionend

Adaptive Weight Approach

Gen and Cheng proposed an adaptive weights approach which utilizes someuseful information from current population to readjust weights in order toobtain a search pressure towards to positive ideal point [16].

For the examined solutions at each generation, two extreme points aredefined: the maximum extreme point z+ and the minimum extreme pint z−

in criteria space as the follows:

z+ = {zmax1 , zmax

2 , . . . , zmaxq }, (5)

z− = {zmin1 , zmin

2 , . . . , zminq }, (6)

where zmink and zmax

k are the maximal value and minimal value for objectivek in current population. Let P denote the set of current population. For agiven individual x, the maximal value and minimal value for each objectiveare defined as the follows:

zmaxk = max{fk(x) | x ∈ P}, k = 1, 2, . . . , q, (7)

zmink = min x{fk(x) | x ∈ P}, k = 1, 2, . . . , q. (8)

The hyper parallelogram defined by the two extreme points is a minimalhyper parallelogram containing all current solutions. The two extreme pointsare renewed at each generation. The maximum extreme point will graduallyapproximate to the positive ideal point. The adaptive weight for objective kis calculated by the following equation:

wk =1

zmaxk − zmin

k

, k = 1, 2, . . . , q. (9)


For a given individual x, the weighted-sum objective function is given bythe following equation:

z(x) =q∑

k=1

wk(zk − zmink ), (10)

=q∑

k=1

zk − zmink

zmaxk − zmin

k

, (11)

=q∑

k=1

fk(x) − zmink

zmaxk − zmin

k

. (12)

As the extreme points are renewed at each generation, the weights arerenewed accordingly. Equation (10) is a hyperplane defined by the followingextreme point in current solutions:

(zmax1 , zmin

2 , . . . , zmink , . . . , zmin

q ), (13)

(zmin1 , zmin

2 , . . . , zmaxk , . . . , zmin

q ), (14)

(zmin1 , zmin

2 , . . . , zmink , . . . , zmax

q ). (15)

3.3 Hybrid Multiobjective Genetic Algorithm

The convergence speed to the local optimum of the GA can be improvedby adopting the probability of simulated annealing (SA). The SA means thesimulation of the annealing process of metal. If the temperature is loweredcarefully from a high temperature in the annealing process, the melted metalwill produce the crystal at 0 K. Kirkpatrick developed an algorithm that findsthe optimal solution by substituting the random movement of the solutionfor the fluctuation of a particle in the system in the annealing process andmaking the objective function value correspond to the energy of the system,which decreases (involving the temporary increase by Boltzman’s probability)with the descent of temperature [17, 18]. Even though the fitness functionvalue of newly produced strings is lower than those of current strings, thenewly produced ones are fully accepted in the early stages of the searchingprocess. However, in later stages, a string with a lower fitness function valueis seldom accepted. The procedure of improved GA by the probability of SAwill be written as follows:

procedure 3.3: Improving of GA chromosome by the probabilityof SAinput: parent chromosome V , proto-offspring chromosomes V ’,temperature T , cooling rate of SA ρoutput: offspring chromosomes V ”beginr ← random[0,1];


∆E ← eval(V ’)-eval(V );if (∆E > 0 ‖ r < Exp(∆E/T )) thenV ”← V ’;elseV ” ← V ;T ← T x ρ;output offspring chromosomes V ”end

In this procedure, V and V ’ are mean parent chromosome and proto-offspring chromosome. V ” means offspring chromosome which produced bythis procedure. The T means the temperature and the means the cooling rateof SA.

The procedure of hybrid multiobjective GA combined with SA will bewritten as followings:

procedure 3.4: Hybrid multiobjective GA combined SAbegint ← 0;initialize P (t);objective P (t);create Pareto E(t);fitness eval(P );while (not termination condition) docrossover P (t) to yield C(t);mutation P (t) to yield C(t);objective C(t);update Pareto E(P,C);fitness eval(P,C);selection P (t+1) from P (t) and C(t);t ← t+1;endend

4 Continuous Task Scheduling

The availability of inexpensive high-performance processors has made itattractive to use multiprocessor systems for real-time applications. The pro-gramming of such multiprocessor systems presents a rather formidable prob-lem. In particular, real-time tasks must be serviced within certain preassigneddeadlines were dictated by the physical environment in which the multiproces-sor systems operates [19].

In this section, a new scheduling algorithm for soft real-time tasks on mul-tiprocessor systems using GA [20] is introduced. Especially, this algorithm is


focused on the scheduling for continuous tasks that are periodic and nonpre-emptive. The objective of this scheduling algorithm is to minimize the totaltardiness.

Some drawbacks (i.e. low resource utilization and avoidable context switch-ing overhead) of RM [3] and EDF [3] derived algorithms for soft real-time taskscould be fixed in introduced algorithm. Not only advantages of RM and EDFapproaches but plus side of GA, such as, high speed, parallel searching andhigh adaptability is kept.

4.1 Continuous Task Scheduling Problem and MathematicalModel

The continuous task scheduling problem is defined as determining the execu-tion schedule of continuous media tasks with minimizing the total tardinessunder the following conditions:

• All tasks are periodic.• All tasks are nonpreemptive.• Only processing requirements are significant; memory, I/O and other re-

source requirements are negligible.• All tasks are independent. This means that there are no precedence con-

strain.• The deadline of a task is equal to its period.• Systems are multiprocessor soft real-time systems.

Figure 2 represents the example of a scheduling for soft real-time tasks onmultiprocessor systems, graphically. Where, i is task index, ci is computationtime of ith task, pi is period of ith task and τij is jth executed task of ith task.

Fig. 2. Example of continuous soft real-time tasks scheduling on multiprocessorsystem


In Fig. 2, the serviced unit time of τ31 is 2 and smaller then the computationtime of τ31. It means that a tardiness has occurred in τ31 and the tardinessis 1. However, the other tasks keep their deadlines.

The continuous soft real-time tasks scheduling problem on multiprocessorsystems can be formulated as follows:

min F (s) =N∑

i=1

ni∑j=1

max{0, (sij + ci − dij)}, (16)

s. t. rij ≤ sij < dij , ∀i, j. (17)

In above equations, notations are defined as follows:

• Indices

m : processor index, m = 1, 2, c Mi : task index, i = 1, 2, c Nj : jth executed task, j = 1, 2, c, ni

• Parameters

M : total number of processorsN : total number of tasksτij : jth executed task of ith taskci : computation time of ith taskpi : period of ith taskT : scheduled timeni : total number of executed times for ith task

ni =[

T

Pi

], i = 1, 2, . . . , N, (18)

rij : jth release time of ith task

rij ={

0 j = 1dij−1, j = 2, 3, . . . , ni

∀i (19)

(20)

dij : jth deadline time of ith task

dij = rij + pi, i = 1, 2, . . . , N, j = 1, 2, . . . ni (21)

• Decision variable

sij : jth start time of ith task

Equation (16) is the objective function and means to minimize the totaltardiness as shown Fig. 3. Equation (17) is the constraint of this problem andmeans that all tasks can start their computation between their release timeand deadline.


Fig. 3. Occurrence of tardiness

4.2 GA Approach

The encoding, decoding algorithm and genetic operations considering tasks’periods is introduced for discussions.

Encoding and Decoding

A chromosome Vk = {vl}, k = 1, 2, c, popSize, represents the relation oftasks and processors. Where popSize is total number of chromosomes in eachgeneration. The locus of lth gene represents the order of tasks and the executedtask and the value of gene vl represents the number of the assigned processor.The length of a chromosome L can be calculated as follows:

L =N∑

i=1

ni. (22)

Figure 4 represents the structure of a chromosome for the proposed geneticalgorithm. The task τ11, τ12 and τN1 are assigned to processor 1, 3 and 1,respectively.

Encoding and Decoding procedures can be explained as:

procedure 4.1: Period-based encodingstep 1: Calculate L and set l=1. L is the length of a chromosome.step 2: Generate a random number r from the range [0..M ] for lth gene.step 3: Increase l by 1 and repeat steps 2–3 until l = L.step 4: Output the chromosome and stop.

procedure 4.2: Period-based decodingstep 1: Create Sm by grouping tasks with same processor number,m = 1, 2, c,M . Sm is scheduling task set on mth processor.step 2: Sort tasks in Sm by the increasing order of the release time rij .step 3: Create the schedule and calculate tardiness.step 4: Output the schedule set and total tardiness and stop.

Fitness Function and Selection

The fitness function is essentially the objective function for the problem.It provides the means of evaluating the search node and it also controls the


Fig. 4. Structure of a chromosome

Fig. 5. Example of the mPUX

selection process [21]. The fitness function used for this GA is based on theF (s) of the schedule. Because the roulette wheel selection is used, the mini-mization problem is converted to the maximization problem, that is, the usedevaluation function is then

eval(VK) = 1/F (s),∀k. (23)

Selection is the main way GA mimics evolution in natural systems. Thecommonly strategies called roulette wheel selection [14,22] has been used.

Genetic Operators

The period unit crossover is proposed in this algorithm. This operator cre-ates two new chromosomes (the offspring) by mating two chromosomes (theparents), which are combined as shown Fig. 5. The periods of each task areselected by random number j and each offspring chromosome is built byexchanging selected periods between parents. Where V′

1 and V′2 means the

offspring 1 and 2, respectively. The procedure will be follows as:

procedure 4.3: Multiperiod unit Crossover (mPUX)step 1: Generate a random number j from the range [1..ni], i = 1, 2, c, N .step 2: Produce offspring chromosomes by exchange the processor numberof the task τij between parents.step 3: Output offspring chromosomes and stop.

For another GA operator, mutation, the classical one-bit altering mutation[23] is used.


4.3 Numerical Results

For the validation of the period based Genetic Algorithm (pd-GA), severalnumerical tests are performed. The pd-GA is compared with Oh–Wu’s algo-rithm [24] by Oh and Wu and Monnier’s algorithm by Monnier et al. [22].Oh–Wu’s algorithm and Monnier’s algorithm use GA. However, these algo-rithms are designed for discrete tasks and use two dimensional chromosomes.

For numerical test, tasks are generated randomly based on exponentialdistribution and normal distribution as follows. Random tasks have been usedby several researchers in the past [22].

cEi = random value based on exponential distribution with mean 5

cNi = random value based on normal distribution with mean 5

rE = random value based on exponential distribution with mean cEi

rN = random value based on normal distribution with mean cNi

pEi = cE

i + rE

pNi = cN

i + rN ,

where cEi and cN

i is the computation time of ith task based on exponential dis-tribution and normal distribution, respectively. pE

i and pNi is the period of ith

task based on exponential distribution and normal distribution, respectively.The parameters were set to 0.7 for crossover (pC ,), 0.3 for mutation (pM ,),

and 30 for population size (popSize). Probabilities for crossover are testedfrom 0.5 to 0.8, from 0.001 to 0.4 for mutation, with the increments 0.05 and0.001, respectively. For population size, individuals from 20 to 200 are tested.Each combination of parameters is tested 20 times, respectively. The bestcombination of parameters is selected by average performance of 20 runs.Figures 6 and 7 show the best result based on best parameter combinationgraphically.

Numerical tests are performed with 100 tasks. Figures 6 and 7 show thecomparisons of results by three different scheduling algorithms. In these fig-ures, the total tardiness of the pd-GA is smaller than that of other algorithms.

Fig. 6. Comparison of results (exponential)


Fig. 7. Comparison of results (normal)

Table 1. Numerical data (total tardiness) of the Figs. 6 and 7

Algorithm Total number of processorsExponential Normal8 15 8 17

Oh-Wu’s algorithm 86 7 103 2Monnier’s algorithm 85 12 117 8pd-GA 81 0 97 0

Table 2. Comparison of other algorithms in terms of better, worse and equal per-formance (exponential)

Algorithm pd-GA Total< = >

Oh-Wu’s algorithm 2 9 9 20Monnier’s algorithm 1 8 11 20

Table 3. Comparison of other algorithms in terms of better, worse and equal per-formance (normal)

Algorithm pd-GA Total< = >

Oh-Wu’s algorithm 2 8 10 20Monnier’s algorithm 0 8 12 20

Table 1 shows numerical data of the Figs. 6 and 7.Tables 2 and 3 are the comparison of results in terms of better, worse

and equal performance. In Table 2, pd-GA performed better than Oh–Wu’salgorithm in nine cases and Monnier’s algorithm in 11 cases. In Table 3, pd-GA performed better than Oh–Wu’s algorithm in 10 cases and Monnier’salgorithm in 12 cases.


5 Real-Time Task Scheduling in HomogeneousMultiprocessor

The optimal assignment of tasks to multiprocessor is, in almost all practicalcases, an NP-hard problem. Monnier et al. presented a GA implementationto solve a real-time nonpreemptive task scheduling problem [22]. The cost ofa schedule is the sum of tardiness of tasks without any successor. Its onlyobjective is to find a zero tardiness schedule. This approach has a weakness inthat deadline constraints of tasks with successors are not considered. Thesealgorithms have only one objective such as minimizing cost, end time, totaltardiness.

Oh and Wu presented a GA for scheduling nonpreemptive soft real-timetasks on multiprocessor [24]. They deal two objectives which are to mini-mize the total tardiness and total number of processors used. However thisalgorithm didn’t refer about confliction between objectives, so called ParetoOptimum, and has some questions for simulation.

In this section, a new scheduling algorithm for nonpreemptive soft real-time tasks on multiprocessor without communication time using multiobjec-tive Genetic Algorithm (moGA) is introduced. The objective of this schedulingalgorithm is to minimize the total tardiness and total number of processorsused. For these objectives, this algorithm is combined with Adaptive WeightApproach (AWA) that utilizes some useful information from the current pop-ulation to readjust weights for obtaining a search pressure toward a positiveideal point [23].

5.1 Soft Real-Time Task Scheduling Problem (sr-TSP)and Mathematical Model

The problem of scheduling the tasks of precedence and timing constrainedtask graph on a set of homogeneous processors is considered in a way that si-multaneously minimizes the number of processors used and the total tardinessunder the following conditions:

• All tasks are nonpreemptive.• Every processor processes only one task at a time.• Every task is processed on one processor at a time.• Only processing requirements are significant; memory, I/O, and other re-

source requirements are negligible.

The problem is formulated under the following assumptions: Computationtime and deadline of each task is known. A time unit is artificial time unit.Soft real-time tasks scheduling problem (sr-TSP) is formulated as follows:

min f1 = M, (24)


min f2 =N∑

i=1

max{0, tsi + ci − di}, (25)

s.t. tEi ≤ tSi ≤ di, ∀i, (26)

tEi ≥ tEj + ci, Jj ∈ pre(Ji), ∀i (27)1 ≤ M ≤ N. (28)


• Indices

i, j : task index, i, j = 1, 2, c, Nm : processor index, m = 1, 2, c,M

• Parameters

G = (T,E) : task graphT = {τ1, τ2, c, τN} : a set of N tasksE = {eij}, i, j = 1, 2, c, N, i = j : a set of directed edges among the tasksrepresenting precedenceτi : ith task, i = 1, 2, c, Npi : mth processor, m = 1, 2, c,Mci : computation time of task τi

di : deadline of task τi

pre∗ (τi) : set of all predecessors of task τi

suc∗ (τi) : set of all successors of task τi

pre(τi) : set of immediate predecessors of task τi

suc(τi) : set of immediate successors of task τi

tEi : earliest start time of ith task

tEi =

{0, if ¬∃τj : eji ∈ E

maxτj∈pre∗(τi)

{tEj + cj

}otherwise ∀i (29)

tLi : latest start time of ith task

tLi =

{di − ci, if ¬∃τj : eij ∈ E

min{ minτj∈suc∗(τi)

{tLj − cj}, di − ci

}otherwise ∀i (30)

• Decision variables

tSi : real start time of ith taskM : total number of processors used


Equations (25) and (26) are the objective function in this scheduling prob-lem. In (25) means to minimize the total number of processors used and (26)means to minimize total tardiness of tasks. Constraints conditions are shownfrom (27) to (29). Equation (27) means that task can be started after itsearliest start time, begin its deadline. Equation (28) defines the earliest starttime of task based on precedence constraints. Equation (29) is nonnegativecondition for the number of processors.

5.2 GA Approach

Several new techniques are proposed in the encoding and decoding algorithmof genetic string and the genetic operations are introduced for discussion.


A chromosome Vk, k = 1, 2, c, popSize, represents one of all the possible map-pings of all the tasks into the processors. Where popSize is the total numberof chromosomes in a generation. A chromosome Vk is partitioned into twoparts u(·), v(·), u (·) means scheduling order and v(·) means allocation infor-mation. The length of each part is the total number of tasks. The schedulingorder part should be a topological order with respect to the given task graphthat satisfies precedence relations. The allocation information part denote theprocessor to which task is allocated.

Encoding procedure is composed of two strategies: strategy I for u(·) andstrategy II for v(·). Procedures will be written as follows:

procedure 5.1: Encoding Strategy I for sr-TSPinput: task graph data setoutput: u(·)beginl ← 1, w ← φ;while (T = φ)w ← w ∪ arg{τi | pre*(τi) =φ, ∀ i};T ← T - {τi}, i ∈ w;while (w = φ)j ← random(w);u(l) ← j;l ← l+1;w ← w - {j};pre*(τi)← pre*(τi) - {τj},∀ i;endendoutput u(·);end


Fig. 8. Example of encoding strategy I procedure

Figure 8 represents the sample of encoding strategy I procedure.

procedure 5.2: Encoding Strategy II for sr-TSPinput: task graph data set, u(·), α, β, M

M ={

M(k − 1), if 1 < k ≤ papSize| subgraph |, if k = 1

output: v(·), Mk

beginl← 1, tm ← 0, idle ← 0;while(l=N)m ← random[1,M];i ← u(l);if (tm < tEi ) thentSi ← tEi ;idle ← idle + (tSi - tm);elsetSi ← tm;if (di is not defined && tSi > tLi ) ‖ tSi > di) thenif (idle/ci < α) thenM ← M +1;m ← M ;idle ← idle + tEi ;tm ← tEi + ci;else


idle ← max{0, (idle -ci)};elsetm ← tSi +ci;v(l) ← m;l ← l+1;idle ← idle +

∑(max{ tm }- tm);

endwhile (idle/

∑M X max{tm}> β)

M ← M -1;idle ← idle - idle/

∑M x max{tm};

endoutput v(·), Mk;end

In encoding strategy II procedure, α, β is boundary constant to decideincreasing the number of processor and decreasing the number of processor,respectively.

Figure 9 represents the example of encoding strategy II procedure.Decoding procedure is will be written as follows:

procedure 5.3: Decoding for sr-TSPinput: task graph data set, chromosome u(·), v(·)output: schedule set S, the total number of processor used f1, totaltardiness of tasks f2

beginl ← 1, tm ← 0, ∀m, idlem ← φ, ∀m, f1 ← 0, f2 ← 0, S ← φ;while (l = N) doi ← u(l);m ← v(l);if (tm =0) then f1 ← f1 +1;IS∗, IF∗ ← find {IS , IF | (IS ,IF ) ∈ idlem, IS = di};if (IS∗ is exist && tm > tLi ) then insert(i);else start(i);add idle();f2 ← f2 + max{0,(tSi +ci - di)};S ← S ∪{(i,m: tSi - tFi )};l ← l+1;endoutput S, f1, f2;end

where insert(i) means to insert τi at idle time if τi is computable in idle time,start(i) means to assign ji to maximum finish time of all assigned task to pm,add idle() means to add idle time to idle time list if idle time is occurred. IS

means the start time of idle duration, IF means the end time of idle duration,


Fig. 9. Example of encoding strategy II procedure

Fig. 10. Example of decoding procedure

idlem means the list of idle time and tm means the maximum finish time ofall assigned task to pm.

Figure 10 represents the example of decoding procedure with chromosomein Figs. 8 and 9.

Evolution Function and Selection

The multi-objective optimization problems have been receiving growing in-terest from researchers with various backgrounds since early 1960. Recently,


GAs have been received considerable attention as a novel approach to mul-tiobjective optimization problems, resulting in a fresh body of research andapplications known as genetic multi-objective optimizations [25].

Adaptive weight approach (AWA) [23] that utilizes some useful informa-tion from the current population to readjust weights for obtaining a searchpressure toward a positive ideal point is combined in this scheduling algorithm.

The evaluation function is designed as follows:

eval(Vk) = 1/F (Vk) (31)

=1∑2

q=1fq(Vk)

fmaxq −fmin

q

. (32)

For selection, the commonly strategy called roulette wheel selection [14],[22] has been used.

GA Operators

The one-cut crossover is used. This operator creates two new chromo-somes (the offspring) by mating two chromosomes (the parent). The one-cutcrossover procedure will be written as follows:

procedure 5.4: One-cut Crossoverinput: parent chromosomes u1(·), v1(·), u2(·), v2(·)output: proto-offspring chromosomes u1’(·), v1’(·), u2’(·), v2’(·)beginr ← random [1, N ];u1’(·) ← u1 (·);v1’(·) ← v1 [1:r] // v2 [r +1:N ];u2’(·) ← u2 (·);v2’(·) ← v2 [1:r] // v1 [r +1:N ];output offspring chromosome u1’(·), v1’(·), u2’(·), v2’(·);end

where u’(·), v’(·) are proto-offspring chromosome. Figure 11 represents theexample of one-cut crossover procedure.

For another GA operator, mutation, the classical one-bit altering mutation[21] is used.

5.3 Validation

To validate proposed moGA, several numerical tests are performed. The in-troduced moGA is compared with Monnier-GA by Monnier et al. [22] andOh–Wu’s algorithm by Oh and Wu [24]. Numerical tests are performed withrandomly generated task graph.


Fig. 11. Example of one-cut crossover

Table 4. Computation results three algorithms

Terms Monnier-GA Oh-Wu’s algorithm moGA

# of processors M 38 37 32makespan 149 157 163computing times (msec) 497 511 518average utilization of 0.447582 0.453392 0.567352processors

P-Method [26] is used for generation task graph. The P-Method of gen-erating a random task graph is based on the probabilistic construction of anadjacency matrix of a task graph. Element aij of the matrix is defined as 1if there is a precedence relation from τi to τj ; otherwise, aij is zero. An adja-cency matrix is constructed with all its lower triangular and diagonal elementsset to zero. Each of the remaining upper triangular elements of the matrix isexamined individually as part of a Bernoulli process with parameter e, whichrepresents the probability of a success. For each element, when the Bernoullitrial is a success, then the element is assigned a value of one; for a failure theelement is given a value of zero. The parameter e can be considered to be thesparsity of the task graph. With this method, a probability parameter of e=1creates a totally sequential task graph, and e=0 creates an inherently parallelone. Values of e that lie in between these two extremes generally produce taskgraphs that possess intermediate structures.

Tasks’ computation time and deadline use generated randomly based onexponential distribution and the parameters of GA is same to those of Sect. 4.

Numerical tests are performed with 100 tasks. Table 4 shows the compar-isons of results by three different scheduling algorithms. There is no tardinessinclusively. The computing time of proposed moGA is a little bit longer thanthose of the other two. However, the number of utilized processors is fewerthan those of the other two algorithms. The variance of processor utilizationrate by moGA is more desirable than those of the others.


Fig. 12. Pareto solution

Figure 12 represents the Pareto solution of moGA and those of Oh–Wu’salgorithm. In this figure, the Pareto solution curve by moGA is closer to idealpoint than that of Oh–Wu’s algorithm.

6 Real-Time Task Scheduling in HeterogeneousMultiprocessor System

In a heterogeneous multiprocessor system, task scheduling is more difficultthan that in a homogeneous multiprocessor system. Recently, several ap-proaches of the genetic algorithm (GA) are proposed. Theys et al. presenteda static scheduling algorithm using GA on a heterogeneous system [27]. And,Page et al. presented a dynamic scheduling algorithm using GA on a hetero-geneous system [28]. Dhodhi et al. presented a new encoding method of GAfor task scheduling on a heterogeneous system [29]. However, these algorithmsare designed for general tasks without time constraints.

In this section, a new scheduling algorithm for nonpreemptive tasks witha precedence relationship in a soft real-time heterogeneous multiprocessorsystem [30] is introduced.

6.1 Soft Real-Time Task Scheduling Problem (sr-TSP)and Mathematical Model

The problem of scheduling the tasks with precedence and timing constrainedtask graph on a set of heterogeneous processors is considered in a way thatminimizes the total tardiness F (x , tS ). Conditions are same to those of Sect. 5.

Soft real-time tasks scheduling problem on heterogeneous multiprocessorsystem to minimize the total tardiness is formulated as follows:

min F (x , tS) =N∑

i=1

max{0,M∑

m=1

(tSi + cim − di) · xim}, (33)


s.t. tEi ≤ tSi ≤ di,∀i, (34)

tEi ≥ tEj +M∑

m=1

cjm · xjm, Jj ∈ pre(Ji),∀i, (35)

M∑m=1

xim = 1,∀i, (36)

xim ∈ {0, 1},∀i,m. (37)


• Indices

i, j : task index, i, j = 1, 2, c, Nm : processor index, m = 1, 2, c,M

• Parameters

G = (T,E) : task graphT = {τ1, τ2, c, τN} : a set of N tasksE = {eij}, i, j = 1, 2, c, N, i = j : a set of directed edges among the tasksrepresenting precedence relationshipτi : ith task, i = 1, 2, c, Neij : procedure relationship between task τi and task τj

pm : the mth processor, m = 1, 2, . . . , Mcim : computation time of task τi on processor pm

di : deadline of task τi

pre*(τi) : set of all predecessors of task τi

suc*(τi) : set of all successors of task τi

pre(τi) : set of immediate predecessors of task τi

suc(τi) : set of immediate successors of task τi

tEi : earliest start time of task τi

tEi =

⎧⎪⎨⎪⎩

0, if ¬∃τj : eji ∈ E

maxτj∈pre∗(τi)

{tEj +

M∑m=1

cjm · xjm

}, otherwise

∀i (38)

tFi : finish time of task τi

tFi = min

{tSi +

M∑m=1

cim · xim, di

},∀i (39)

• Decision variables

tSi : real start time of ith task τi


Fig. 13. Time chart of sr-TSP

xim ={

1, if processor pmis selected for task τi

0, otherwise. (40)

Equation (36) is the objective function in this scheduling problem.Equation (36) means to minimize total tardiness of tasks. Constraintsconditions are shown from (37) to (40). Equation (37) means that task can bestarted after its earliest start time, begin its deadline. Equation (38) definesthe earliest start time of task based on precedence constraints. Equation (39)means that every task is processed on one processor at a time. Figure 13represents the time chart of sr-TSP.

6.2 GA Approach

The solution algorithm is based on genetic algorithm (GA). Several new tech-niques are proposed in the encoding and decoding algorithm of genetic stringand the genetic operations are introduced for discussion.


A chromosome Vk, k = 1, 2, c, popSize, represents one of all the possible map-pings of all the tasks into the processors. Where popSize is the total numberof chromosomes in a generation. A chromosome Vk is partitioned into twoparts u(·), v(·). The u(·) means scheduling order and the v(·) means alloca-tion information. The length of each part is the total number of tasks. Thescheduling order part should be a topological order with respect to the giventask graph that satisfies precedence relationship. The allocation informationpart denote the processor to which task is allocated.

Encoding procedure for sr-TSP will be written as follows:

procedure 6.1: Encoding for sr-TSPinput: task graph data set, total number of processors Moutput: u(·), v(·)beginl ← 1, W ← φ;while(T = φ)W ← W ∪ arg {τi | pre*(τi)=φ, ∀ i };


T ← T - {τi}, i ∈ W ;while (W = φ )j ← random(W);u(l) ← j;W ← W - {j};pre*(τi )← pre*(τi ) - {τj}, i;m ← random[1:M ];v(l) ← m;l ← l+1;endoutput u(·), v(·);end

Where, W is temporary defined working data set for tasks without pre-decessors. In encoding procedure, feasible solutions are generated by respect-ing the precedence relationship of task and allocated processor is selectedrandomly.

Decoding procedure is will be written as bellows.

procedure 6.2: Decoding for sr-TSPinput: task graph data set, chromosome u(·), v(·)output: schedule set S, total tardiness of tasks Fbeginl ← 1, F ← 0, S ← φ;while (l = N)i ← u(l);m ← v(l);if (exist suitable idle time)

theninsert(i);start(i);update idle();F ← F +max{0,(tSi +cim-di)};S ← S ∪ {(i, m: tSi -tfi )};l ← l+1;endoutput S, Fend

Where insert(i) means to insert τi at idle time if τi is computable in idletime. At start(i), the real start time of ith task tSi and the finish time of ithtask tFi can be calculated. update idle() means that the list of idle time isupdated if new idle time duration is occurred. The objective value F (x, tS)and schedule set S is generated through this procedure.


Evolution Function and Selection

The fitness function is essentially the objective function for the problem. Itprovides a means of evaluating the search node and it also controls the selec-tion process [23,25].

The fitness function is based on the F (x , tS) of the schedule. The usedevaluation function is then

eval(Vk) = 1/F (x, tS),∀k (41)

Selection is the main way GA mimics evolution in natural systems: fitteran individual is, the highest is its probability to be selected. For selection, thecommonly strategies called roulette wheel selection [14,22] has been used.

GA Operators

For crossover, the one-cut crossover in Sect. 5 is used. For another GA opera-tor, mutation, the classical one-bit altering mutation [21] is used.

Improving of Convergence by the Probability of SA

In this scheduling algorithm, the introduced method for improving of conver-gence by the probability of SA in Sect. 2 is combined.

6.3 Validation

To validate proposed hybrid Genetic Algorithm combined Simulated Anneal-ing (hGA+SA), several numerical tests are performed. The hGA+SA is com-pared with Monnier’s GA and proposed simple GA which is not combined withSA. The Monnier’s GA is concerned to homogeneous multiprocessor systemand the hGA+SA is designed for heterogeneous multiprocessor system. Asthere are no algorithms which are concerned to heterogeneous multiprocessorsystem, the hGA+SA is compared with Monnier’s GA on heterogeneous mul-tiprocessor system. The Monnier’s GA is proposed by Monnier, Beauvais andDeplanche [22]. This algorithm based on simple GA use linear fitness normal-ization technique for evaluating chromosomes. The linear fitness normalizationtechnique is effective to increase competition between similar chromosomes.However this method is limited in special problem with similar chromosomes.And in this algorithm, insertion method is not used. In other words, althoughthere is idle time, task can not be executed in idle time.

Numerical tests are performed with randomly generated task graph. P-Method [26] for generation task graph is used. Tasks’ computation time anddeadline are generated randomly based on exponential distribution. The pa-rameters of GA is same to those of Sect. 4.


Monnier’s GA

450

400

350

300

250to

tal t

ardi

ness

total number of processors

200

150

100

50

00 2 4 6 8 10 12 14

simple GA hGA+SA

Fig. 14. Comparison with three algorithms for F (x, ts)

Table 5. Comparison with three algorithms

Terms Monnier’s GA Simple GA hGA+SA

# of processors M 13 13 12makespan 123 120 132computing times (msec) 243 245 338average utilization of 0.4334 0.4375 0.5702processors

Numerical tests are performed with 100 tasks. Figure 14 shows that thecomparison with three algorithms for F (x, tS). In Fig. 14, F (x, tS) ofhGA+SA is smaller that of each algorithms.

In Table 5, some terms such as makespan, computing time and the uti-lization of processors are compared on the total number of processors withouttardiness. Total number of processors without tardiness of hGA+SA is smallerthan that of other algorithms and the average utilization of processors ofhGA+SA is more desirable than those of the others.

7 Conclusions

In this chapter, several scheduling algorithm for soft real-time tasks usinggenetic algorithm (GA) are introduced.

Several derived algorithms from rate monotonic (RM), earliest deadlinefirst (EDF) for hard real-time tasks or some scheduling algorithms such asrate regulating proportional share (rrPS) and modified proportional share(mPS) have been used for soft real-time tasks. However, these algorithms havesome drawbacks in resource utilization and pattern of degradation under the


overloaded situation. Furthermore, the scheduling on multiprocessor systemis NP-hard problem.

The introduced algorithms in this chapter use GA. GA has been knownto offer significant advantages against conventional heuristics by using simul-taneously several search principles and heuristics.

In the hybrid GA (hGA) combined with simulated annealing (SA), theconvergence of GA is improved by introducing the probability of SA as thecriterion for acceptance of the new trial solution. This hybridization does nothurt own advantages of GA but finds more accurate solutions in later stageof searching process.

The multiobjective GA for soft real-time task scheduling also is introduced.Not only minimization the total tardiness but also minimization the totalnumber of processor used and the makespan are taken into considerations.However, since these objectives are in conflicting (trade-offs) relations, thePareto optimum concept is introduced to solution process.

In conclusion, from introduced scheduling algorithm and their experimentresults we can see that the scheduling algorithm using GA is very promisingapproach for obtaining relatively satisfactory solutions to soft real-time taskscheduling problem, which belong to the difficult NP-hard problem. All ofthe techniques developed for theses problems in this research are useful andapplicable for other scheduling problems. The research field will be extendedto logistic problem and process planning problem.

References

1. Krishna, C. M. and G. S. Kang (1997) Real-Time System, McGraw-Hill, NewYork

2. Yoo, M. R., B. C. Ahn, D. H. Lee and H. C. Kim (2001) A New Real-TimeScheduling Algorithm for Continuous Media Tasks, Proc. of Computers andSignal Processing, pp.26–28.

3. Liu, C. L. and J. W. Layland (1973) Scheduling Algorithm for Multiprogram-ming in a Hard Real-Time Environment, Journal of the ACM, vol. 20, no. 1,pp. 46–59.

4. Kim, M. H., H. G. Lee and J. W. Lee (1997) A Proportional-Share Schedulerfor Multimedia Applications, Proc. of Multimedia Computing and Systems, pp.484–491.

5. Yoo, M. R. (2002) A Scheduling Algorithm for Multimedia Process, Ph.D. dis-sertation, University of YeoungNam, (in Korean).

6. Yalaoui, F. and C. Chu (2002) Parallel Machine Scheduling to Minimize TotalTardiness, International Journal of Production Economics, vol. 76, no. 3, pp.265–279.

7. Du, J. and J. Leung (1990) Minimizing Total Tardiness on One Machine isNP-hard, Mathematics of Operational Research, vol. 15, pp. 483–495.

8. Lenstra, J. K., R. Kan and P. Brucker (1997) Complexity of Machine SchedulingProblems, Annals of Discrete Mathematics, pp. 343–362.


9. Zhu, K., Y. Zhuang and Y. Viniotis (2001) Achieving End-to-End Delay Boundsby EDF Scheduling without Traffic Shaping, Proc. of 20th Annual Joint Con-ference on the IEEE Communications Societies, pp. 1493–1501.

10. Diaz, J. L., D. F. Garcia and J. M. Lopez (2004) Minimum and Maximum Uti-lization Bounds for Multiprocessor Rate Monotonic Scheduling, IEEE Trans-actions on Parallel and Distributed Systems, vol. 15, no. 7, pp. 642–653.

11. Bernat, G., A. Burns and A. Liamosi (2001) Weakly Hard Real-Time Systems,Transactions on Computer Systems, vol. 50, no. 4, pp. 308–321.

12. Goldberg, D. E. (1989) Genetic Algorithms in Search, Optimization & MachineLearning, Addison-Wesley.

13. Fogel, D. and A. Ghozeil (1996) Using fitness distributions to design moreefficient evolutionary computations, Fogel, D., editor, Proc. of the Third IEEEconference on Evolutionary Computation, IEEE Press, Nagoya, Japan, pp. 11–19.

14. Gen, M. and R. Cheng (1997) Genetic Algorithms & Engineering Design, JohnWiley & Sons.

15. Xu, H. and G. Vukovich (1998) Fuzzy Evolutionary Algorithms and AutomaticRobot Trajectory Generation, Fogel, D. editor, Proc. of the First IEEE Confer-ence on Evolutionary Computation, IEEE Press, Piscataway, NJ, pp. 595–600.

16. Ishii, H., H. Shiode, and T. Murata (1998) A Multiobjective Genetic LocalSearch Algorithm and Its Application to Flowshop Scheduling, IEEE Trans.on Systems, Man and Cybernetics, vol. 28, no. 3, pp. 392–403.

17. Kim, H. C., Y. Hayashi and K. Nara (1997) An Algorithm for Thermal UnitMaintenance Scheduling through combined use of GA, SA and TS, IEEE Trans-actions on Power Systems, vol. 12, no. 1, pp. 329–335.

18. Kirkpatrick, S., C. D. Gelatt and M. P. Vecchi (1983) Optimization by Simu-lated Annealing, Science, vol. 220, no. 4598, pp. 671–680.

19. Denouzos, M. L. and Mok, A. K. (1989) Multiprocessor on-line scheduling ofhard-real-time tasks, IEEE Transactions on Software Engineering, vol. 15, no.12, pp. 392–399.

20. Yoo, Myungryun and M. Gen (2005) Multimedia Tasks Scheduling using Ge-netic Algorithm, Asia Pacific Management Review. vol. 10, no. 6, pp. 373–380.

21. Jackson, L. E. and G. N. Rouskas (2003) Optimal Quantization of Periodic TaskRequests on Multiple Identical Processors, IEEE Transactions on Parallel andDistributed Systems, vol. 14, no. 8, pp. 795–806.

22. Monnier, Y., J. P. Beauvais and A. M. Deplanche (1998) “A Genetic Algo-rithm for Scheduling Tasks in a Real-Time Distributed System”, Proc. of 24thEuromicro Conference, pp. 708–714.

23. Gen, M. and R. Cheng (2000) Genetic Algorithms & Engineering Optimization,John Wiley & Sons.

24. Oh, J. and C. Wu (2004) Genetic-algorithm-based Real-time Task Schedulingwith Multiple Goals, Journal of Systems and Software, vol. 71, no. 3, pp. 245–258.

25. Deb, K. (2001) Multi-objective Optimization using Evolutionary Algorithms,John Wiley & Sons.

26. Al-Sharaeh, S. and B. E. Wells (1996) A Comparison of Heuristics for ListSchedules using The Box-method and P-method for Random Digraph Genera-tion, Proc. of the 28th Southeastern Symposium on System Theory, pp. 467–471.


27. Theys, M. D., T. D. Braun, H. J. Siegal, A. A. Maciejewski and Y. K. Kwok(2001) Mapping tasks onto distributed heterogeneous computing systems usinga genetic algorithm approach, Zomaya, A. Y., F. Ercal and S. Olariu, editors,Solutions to Parallel and Distributed Computing Problems, chapter 6, pp. 135–178, Wiley, New York.

28. Page, A. J. and T. J. Naughton (2005) Dynamic task scheduling using geneticalgorithm for heterogeneous distributed computing, Proc. of 19th IEEE Inter-national Parallel and Distributed Processing Symposium, 189.1.

29. Dhodhi, M. K., I. Ahmad, A. Yatama and I. Ahmad (2002) An integrated tech-nique for task matching and scheduling onto distributed heterogeneous com-puting systems, Journal of Parallel and Distributed Computing, vol. 62, pp.1338–1361.

30. Yoo, Myungryun and M. Gen (2005) Multiobjective genetic algorithm for real-time task scheduling in heterogeneous multiprocessors system, 6th InternationalSymposium on Advanced Intelligent Systems, Yeosu in Korea, pp. 838–843.

Computational Intelligence in Visual SensorNetworks: Improving Video ProcessingSystems

Miguel A. Patricio, F. Castanedo, A. Berlanga, O. Perez, and J. Garcıa,and Jose M. Molina

Applied Artificial Intelligence Group, Universidad Carlos III de Madrid, Avda.Universidad Carlos III, 22. 28270 – Colmenarejo, Madrid, Spain,[email protected], [email protected], [email protected],

[email protected], [email protected], [email protected]

Summary. In this chapter we will describe several approaches to develop videoanalysis and segmentation systems based on visual sensor networks using compu-tational intelligence. We review the types of problems and algorithms used, andhow computational intelligence paradigms can help to build competitive solutions.computational intelligence is used here from an “engineering” point of view: thedesigner is provided with tools which can help in designing or refining solutions tocope with real-world problems. This implies having an “a priori” knowledge of thedomain (always imprecise and incomplete) to be reflected in the design, but with-out accurate mathematical models to apply. The methods used operate at a higherlevel of abstraction to include the domain knowledge, usually complemented withsets of pre-compiled examples and evaluation metrics to carry out an “inductive”generalization process.

1 Introduction

Processing multimedia information is getting more and more important invideo surveillance and sensor networks [1]. The particular conditions to oper-ate this type of systems require from quite specialized solutions. The trackingalgorithms used to segment multimedia and video data must handle complexsituations such as objects interactions and occlusions, sudden manoeuvres,etc. and they are usually the most flexible and parametrical part of visionsystems. Practically all systems exploit external information to model thescene, objects behavior, context, etc. The configuration is done aiming at atrade-off between computational resources and system performance.

Multimedia surveillance systems are a new generation of architectural sys-tems where many different media streams will concur to provide an automaticanalysis of the controlled environment and a real-time interpretation of the

M.A. Patricio et al.: Computational Intelligence in Visual Sensor Networks: Improving Video

Processing Systems, Studies in Computational Intelligence (SCI) 96, 351–377 (2008)


352 M.A. Patricio et al.

scene [2]. Among the whole multimedia sources (images, audio, sensor sig-nals, textual data, etc.), video is the most powerful media stream to gathersurveillance information.

Current video surveillance systems [3] are conceived to deal with a largenumber of cameras. The challenge of extracting useful data from a visual sur-veillance system could become an immense task if it stretches to a sizablenumber of cameras. Consequently, content-based retrieval of video data turnsout to be a challenging and important problem. In this chapter, we presenthow computational intelligence paradigms are applied to infer semantic infor-mation automatically from raw video data. More precisely, we will show theapplication of computational intelligence techniques, within the framework ofvisual sensor networks, to the improvement of the video procedures: from de-tection to tracking process. In next sections, we will show real developmentsof computational intelligence in video surveillance.

2 Related Works

2.1 Visual Sensor Networks

Visual sensor networks [3] are related to spatially distributed multi-sensorenvironments which raise interesting challenges for surveillance. These chal-lenges concern to data fusion techniques to deal with the sharing of informa-tion gathered from different types of sensors [4], communication aspects [5],security of communications [5] and sensor management. These new systemsare called “third-generation surveillance system”, which would provide highlyautomated information, as well as alarms and emergencies management.PRISMATICA [6] is an example of these systems. It consists of a network ofintelligent devices that process sensor inputs. These devices send and receivemessages to/from a central server module. The server module co-ordinates de-vice activity, archives/retrieves data and provides the interface with a humanoperator. The design of a surveillance system with no server to avoid thiscentralization is reported in [4]. As part of the VSAM project, [4] presentsa multi-camera surveillance system based on the same idea as [7]: the cre-ation of a network of “smart” sensors that are independent and autonomousvision modules. The surveillance systems described above take advantage ofprogress in low-cost high-performance processors and multimedia communi-cations. However, they do not account for the possibility of fusing informationfrom neighboring cameras.

Third generation surveillance systems [8] is the term usually used in the lit-erature to refer to systems conceived to deal with a large number of cameras,a geographical spread of resources, many monitoring points, and to mirrorthe hierarchical and distributed nature of the human process of surveillance.From an image processing point of view, they are based on the distribution of

Computational Intelligence in Visual Sensor Networks 353

processing capacities over the network and the use of embedded signal process-ing devices to give the advantages of scalability and potential robustness ofdistributed systems.

A multiagent visual sensor network is a distributed network of several in-telligent software agents with visual capabilities [3]. An intelligent softwareagent is a computational process that has several characteristics [9], (1) “re-activity” (allowing agents to perceive and respond to a changing environ-ment), (2) “social ability” (by which agents interact with other agents) and(3) “proactiveness” (through which agents behave in a goal-directed fashion).Wooldridge and Jennings also give a strong notion of agent which also usesmental components such as belief, desire and intentions (BDI).

The main goals that are expected from a generic third generation visionsurveillance application, based on end-user requirements, are to provide goodscene understanding, oriented to attract the attention of the human operatorin real time.

2.2 Intelligent Visual Tracking Systems

Intelligent visual tracking systems (IVTS) track all the targets moving withinits local field of view. The IVTS implementation are arranged in a pipe-linestructure of several modules, as shown in Fig. 1; it directly interfaces with theimage stream coming from a camera and extract the track information of themobile objects in the current frame. The interface between adjacent modulesis symbolic data and it is set up so that for each module different algorithmsare interchangeable.

The main modules of the IVTS implementation are: (1) a detector processof moving objects; (2) an association process; (3) a prediction process;(4) blob1 deleter; (5) track updater.

The detector process (1) of moving objects must give a list of blobs thatare found in a frame, this list must contain information about the position andsize of each blob. Within the tracking process and continuing with the list ofblobs obtained by the previous module, the association process (2) will solvethe problem of blob-to-track multi-assignment, where several (or none) blobsmay be assigned to the same track and simultaneously several tracks couldoverlap and share common blobs. So, the association problem to solve, is thedecision of the most proper grouping of blobs and the assignation to each trackfor each frame processed. The prediction process (3) uses the association made

Fig. 1. Intelligent visual tracking system implementation

1 Detected pixels which forms compact regions.


by the tracking process and predicts where each track will move to during thenext frame, this prediction will be used by the tracking process in order tomake the association. The blob deleter (4) module eliminates those blobs thathave not been associated to any track, thus they are considered to be noise.The last main module, the track updater (5), updates the tracks obtained inthe last frame, with the information obtained from the previous modules forthis frame.

A key aspect to have a IVTS implementation is a robust movement seg-mentation. Precisely, this has been the objective of many research works [10].Although plenty of techniques have been applied for video segmentation, itis still a difficult and unresolved problem in the general case and under com-plex situations. The basic aspects to address are: extraction of moving objectsfrom the background and precise separation of individual objects when theirimages appear close to each other [11].

2.3 Data Association Process

Tracking multiple visual targets involving occlusion and varying number prob-lems is a challenging problem in IVTS. A primary task of the multi-targettracking (MTT) system is data association, namely, partitioning the measure-ments into disjoint sets, each generated by a single source (target or clutter).Target splitting and merging distinguish video data processing with respectto other sensor data sources, forcing the data association (or correspondence)task to demand for powerful and specific techniques.

Although plenty of techniques have been researched for video segmenta-tion, it is still a difficult and not resolved problem in the general case withreal situations. Detected pixels are first connected to form compact regionsreferred to as blobs. The tracker should re-connect these blobs to segment alltargets from background and track their motion, applying association and fil-tering processes [12]. Usual problems are clutter (false objects such as smoke,waving trees, etc.), occlusions, shadows, splits of objects in regions, and merg-ings of different objects due to overlaps.

Figure 2 illustrates an example, where two targets (aircraft moving onparallel airport taxiways) are the source of several blobs separated from thebackground. The blobs from each aircraft should be grouped to track theindividual trajectories, even while the partial occlusion, and “false” blobscorresponding to smoke should be wipped-off.

The problem to solve, known as data association [13], is the decision ofthe most proper grouping of blobs and assignment to tracks for each frameprocessed. The performance of final system critically depends on the trade-off considered in data association. Next we briefly formulate this problem,describe the existing approaches, and we will describe our proposals to exploitcontextual information of visual trackers using different CI paradigms such asfuzzy rules and generalization through evolutionary computation of heuristicfunctions.


Fig. 2. Blob-to-track association problem

Although visual tracking has been extensively studied, most works assumethat motion correspondence problem is solved during image segmentation or istrivial, so that a simple strategy such as nearest neighbor (NN) is applied. Theproblem of object split and merging has recently received a wider attentionby the machine-vision community, from different points of view. Conventionaldata association systems, such as NN, MHT [14] or S-D [15] deal the problemas minimizing a global cost function in a combinatorial space. As alternative,an all-neighbors approach such as Joint Probabilistic Data Association orPMHT [16], have been also applied to this problem, all blobs gated with eachtrack are used to update it, requiring besides lower memory and computation.


Some proposals apply lower-level image information to address the problem.For instance, w4 system [17] is based on low-level correlation operators toresolve occlusions and merging in people-group tracking.

3 Multiagent Visual Sensor Network: Overview

In [3], authors have developed a novel multiagent framework for delibera-tive camera-agents forming visual sensor networks. In this framework, eachcamera is represented and managed by an individual software agent, calledsurveillance-sensor agent (SSA). SSAs are located at the same level (sensorlayer), so that it allows coordination execution among SSAs. Each SSA knowsonly part of the information (partial knowledge due to its limited field ofview), and has to make decisions with this limitation. Furthermore, each SSAtracks all the targets moving within its local field of view. The distributednessof this type of systems supports the SSAs’ proactivity, and the cooperationrequired among these agents to accomplish surveillance justifies the sociabil-ity of surveillance-sensor agents. The details of the multiagent visual sensornetwork architecture are described formally and more extensive in [1, 3, 18].

3.1 Cooperative Surveillance Multiagent Architecture

In order to provide a good understanding of the environment each processinvolved in the surveillance system (in our case agents) has to reason aboutthe actions that take in each moment. This level of reasoning is not possibleat low level image processing algorithms. Therefore a multiagent system isnecessary in order to provide the reasoning capabilities.

Using a multiagent architecture for video surveillance provides several ad-vantages. First of all, the loosely coupled nature of the multi-agent architec-ture allows more flexibility for the communication processes. Also the abilityto assign responsibilities for each agent is ideal for solving complex tasks ina surveillance system. This complex tasks involves the use of mechanismssuch as coordination, dynamic configuration and cooperation that are widelystudied in the multiagent community.

Intelligence in artificial vision systems, such as our propose framework[1, 3, 18], operates at different logical levels. In the first level, the processof scene interpretation from each sensor is carried out by a surveillance-sensor agent. As a second level, the information parsed by each individualsurveillance-sensor agent is collected and fused. The fusion process is car-ried out by a fusion agent in the multiagent surveillance system. Finally, thesurveillance process is distributed over several surveillance-sensor agents, ac-cording to their individual ability to contribute with their local informationto a desired global solution.


Fig. 3. CS-MAS architecture

A distributed solution has several advantageous respect to a centralizedsolution from the points of view of scalability and fault-tolerance. In ourapproach, distribution is obtained from a multiagent system, where each ca-mera is represented and managed by an individual autonomous software agent(surveillance-sensor agent). Each surveillance-sensor agent knows only part ofthe information (partial knowledge), and has to take decisions with this lim-itation. The distributed nature of this type of systems supports the proacti-vity of surveillance-sensor agents, additionally the cooperation required amongthem to accomplish the surveillance task justify the sociability of surveillance-sensor agents. The intelligence produced by the symbolic internal model ofsurveillance-sensor agents is based on a deliberation about the state of theoutside world (including its past evolution), and the actions that may takeplace in the future.

In the previous figure (Fig. 3) we show the description of the multiagentarchitecture, as we could see, there are six different types of agents:

1. Surveillance-sensor agent. It tracks all the targets moving within its localfield of view and sends data to their related fusion agent. It is coordi-nated with other surveillance-sensor agents in order to improve surveil-lance quality.

2. Fusion agent. This agent integrates the information sent from the as-sociated surveillance-sensor agents. It analyzes the situation in order tomanage the resources and to coordinate the associated surveillance-sensoragents during the fusion stage.

3. Record agent. This type of agent belongs to a specific camera only withrecording features.

4. Planning agent. This agent provides a scene overview. It makes inferenceson the targets and the situation.


5. Context agent. This type of agent provides context aware information ofthe scene.

6. Interface agent. The input/output agent interface of the multi-agent sys-tem. It provides a graphical user interface to the end user.

We use the Belief-Desire-Intention (BDI) model to implement the deli-beration and reasoning from the images captured from the camera. The BDImodel is one of the best known and studied models of practical reasoning [19].It is based on a philosophical model of human practical reasoning, originallydeveloped by Bratman [20]. It reduces the explanation for complex human be-havior to a motivational stance [21]. This means that the causes for actions arealways related to the human desires ignoring other facets of human motiva-tions to act. An finally, it also uses, in a consistent way, psychological conceptsthat closely correspond to the terms that humans often use to explain theirbehavior. The foundation for most implemented BDI systems is the abstractinterpreter proposed by Rao and Georgeff [19]. Although many ad hoc imple-mentations of this interpreter have been applied to several domains, recentlythe release of JADEX [22] is obtaining an increasing acceptance. JADEX fa-cilitates FIPA-ACL communications between agents, and it is widely used toimplement intelligent software agents.

The sociability of agents presumes some kind of communication betweenagents. The most accepted agent communication schemes are those based inSpeech–Act Theory (for instance, KQML and FIPA-ACL) [23], we use FIPA-ACL as communication language between the agents.

The internal technical aspect of the fusion agent can be consulted in [24],where a coordination approach is presented. Tracking results from [24] ofthree different cameras from the open computer vision data set PETS2006are shown in Fig. 4.

In next sections, we will present the application of computational intelli-gence paradigms to the improvement of the surveillance-sensor agent proce-dures.

Fig. 4. Camera 1, camera 2, and camera 3 local tracking


4 Optimizing the Whole Tracking Systemby Means of Evolutionary Techniques

4.1 General Optimization Problem

As could be seen in Fig. 1, the surveillance-sensor agent is made up of sev-eral interconnected blocks, that could be grouped in five general subsystems:background estimation, detector, segmentation module, association block andtracking system (Fig. 5). Moreover, each of these blocks is regulated by a setof parameters.

The good performance of all the blocks is important to obtain good finalresults. Indeed, errors made at the lower levels are very difficult to correct athigher levels [25]. That is, if an object is not detected at the first stages of thesystem, it can not be tracked and classified at the higher levels.

Hence, each of the blocks is regulated by a set of parameters that must beproperly adjusted for the good performance of the whole system. For example,the detector threshold fix the threshold in the detector to consider a pixel asmovement, background variation or just background.

Thus, when adjusting these control parameters, we must have a criterionto measure the good or a bad performance of the system. The core of thisprocess is the evaluation of surveillance results, defining a metric to measurethe quality of a proposed configuration [26].

Fig. 5. Information levels in processing chain. Results of detector, segmentationand tracking module


Moreover, the visual system must provide the best solution for the mostgeneral set of examples. Therefore the system need to work properly under dif-ferent lighting or weather conditions and have a good performance for variousscenes in case we have a single movable camera.

As a result of this, the set of examples used to design and train the systemmust produce a general solution of the tracking system. A small set of ex-amples can lead to adapt the over-fitted parameters exactly to these specificscenarios, with the consequent loss of generality. On the contrary, randomselected examples might produce disorientation in the search. Thus, the setof data that optimize the search of the suitable parameters is defined as theideal trainer [27].

Thus, the final goal is the search of the most general set of parameters toconfigure a whole video surveillance system and achieve an optimal perfor-mance under representative and general situations.

4.2 Proposed Optimization Using ES

Our approach to achieve our goal follows several steps:First of all, a set of evaluation metrics per track have been proposed to

assess the input parameters. The core of the evaluation function uses metricsbased on ground truth to evaluate the output quality for any configuration.

Next, we take a representative set of scenarios to train the system.Then, the final aspect is the adjustment of these parameters. By using

the evaluation metric mentioned above, we can apply different techniques toassess the control parameters in order to regulate the performance of thetracking system and subsequently optimize them. Classical techniques of op-timization such as those based on a gradient descent are poorly suitable tothese types of problems, due to the high number of local minimal presentedby the fitness landscape. More appropriate techniques are those based onevolutionary algorithms (EA) such as genetic algorithms (GA) or evolutionstrategies (ES) [28–30]. Particularly, we are going to select evolution strategies(ES) for this problem because they present high robustness and immunity tolocal extremes and discontinuities in fitness function [31–33].

Therefore, the tool used to look for the adjustment of the parameters isthe Evolution Strategies.

Finally, we need to propose an algorithm of generalization that allowsus to find the most suitable set of parameters for the surveillance systemfor different scenarios. The generalization method consists of combining theevaluation function of each track in several ways and steps in order to built agradually more general fitness function.

The parameters that control our surveillance system and must be opti-mized in this particular application are:

• THRESHOLD: it defines if a pixel could be considered as moving targetor a variation in the background.


• MIN AREA: defines a minimum area of blob in order to reduce falsedetections due to noise.

• MARGIN GATE: an outer gate defining the permissible area for search-ing for blobs separated from the estimated rectangular box enclosing thetarget.

• MINIMUM DENSITY: the minimum density required when the blobs areconnected in order to form a bigger blob that represents a target.

• CONFLICT: this parameters decides if tracks are extrapolated or not whenthere exit overlap between tracks.

• VARIANCE ACEL: the smoothing degree of Kalman filter used in thetracker.

• MINIMUM TRACK AREA: it defines a minimum area of track in orderto reduce wrong tracks, probably containing fragments of the real targets.

• MARGIN INITIALIZATION: defining the protected areas around con-firmed tracks to avoid creation of potential tracks.

4.3 Adjustment of Surveillance System: Evaluationand Generalization

The performance evaluation calculates some numerical values by means ofa set of proposed metrics, based on the ground truth. This ground truth isthe result of a study from pre-recorded video sequences and a subsequentprocess in which a human operator selects coordinates for each target [34].The coordinates of the targets are selected frame by frame; they are markedand bounded with rectangles, taking the upper-left and lower-right corners aslocation of target objectives.

The evaluation system computes four parameters per target which areclassified into “accuracy metrics” and “continuity metrics” (Fig. 6):

• Accuracy Metrics:1. Overlap-area (OAP). Overlap area (in percentage) between the real

target and the detected track.2. X-error (Ex) and Y-error (Yx). Difference, in x and y coordinates,

between their centers.• Continuity Metrics:

1. Number of tracks per target (NT). It is checked if more than one de-tected track is matched with the same ideal track. If this happens, theprogram keeps the detected track which has a bigger overlapped areavalue, removes the other one and marks the frame with a flag thatindicates the number of detected tracks associated to this ideal one.

2. Commutation (C). A commutation occurs when the identifier of a trackmatched to an ideal track changes. It typically takes place when thetrack is lost and recovered later.


Fig. 6. Evaluation system

Fig. 7. Example of mismatched track. There are three tracks and only two targets

The evaluation function is based on the previous metrics, by means of aweighted sum of different terms which are computed for each target i in ascenario j:

ei,j =W1M

T+

W2

∑(1 − OAP ) + W3

∑EX + W4

∑EY

CT

+W5OC + W6UC + W7

∑C

T(1)

with the terms defined as follows:

• Mismatch (M): A counter which stores how many times the ground truthand the tracked object data do not match up (Refer to Fig. 7).

• The three next terms are the total sum of the non-overlapped areas(∑

(1 − OAP )) and the central error of x (∑

EX) and y axes (∑

EY ).• The next two elements are two counters:

– Overmatch-counter (OC): how many times the ground truth track ismatched with more than one track object data.


– Undermatch-counter (UC ): how many times the ground track is notmatched with any track at all.

• The number of commutations in the track under study (∑

C).• The continuity elements are normalized by the time length of track, T ,

while the accuracy terms are normalized by the time length of track beingcontinuous, CT (i.e. when they can be computed).

• W1,2,3,4,5,6,7 are the relative weights for the terms. Highest values havebeen given to the continuity terms, since this aspect is the key to guaranteethe global viability.

In order to carry out a general evaluation (algorithm of generalization)over different targets and cases, aggregation operators must be applied overpartial evaluations. The initial or basic function is this evaluation functionper target (or track), where xi,j is the vector of metrics and θ is the vector ofparameters to optimize.

ei,j = f(xi,j , θ), (2)

Thus, the extension of the evaluation function must allow assessing simul-taneously:

• One or various targets per scenario: Scenario j: {e1,j , e2,j , ...,eNj ,j}• Various scenarios with several targets per scenario: M Scenarios: {e1,1,

e2,1, ...,eN1,1, ..., e1,j , e2,j , ...,eNj ,j ,..., e1,M , e2,M , ...,eNM ,M}Two aggregation operators have be analysed:

• Sum:Ej =

∑i

ei,j , (3)

E =∑

i

∑j

ei,j . (4)

• Maximum (or Minimax):

Ej = maxi(ei,j), (5)

E = maxi(maxj(ei,j)). (6)

In [35,36], the authors showed that a significant improvement of the globalvision system is achieved, in terms of accuracy and robustness. With thismethodology based on the optimization, the inter-relation of parameters atdifferent levels allows a coherent behavior under different situations. A gener-alization analysis has shown the capability to overcome the over-adaptationwhen particular cases are considered and a continuous improvement whenadditional samples are aggregated in the training process, comparing two dif-ferent operators: sum and worst-case aggregation.

To demonstrate the validation of our methodology, we compare our track-ing system, tuned after the generalization process, against some existing


methods. All the next tracking systems are available in the open softwareof [37,38].

• CC (Connected Component Tracking) [39].• MS (Mean Shift Tracking) or Kernel Based Tracking [40,41].• MSPF (Particle Filter based on MS weight) [42].• CCMSPF (Connected Component tracking and MSPF resolver for colli-

sion) [39].• MSFG (Mean Shift Tracking with FG mask using) [43].• CGA (Association by Compact Genetic Algorithm) [44].

The training videos that we have used for the experiments consist of aset of three types of scenarios in an airport domain [36]. The scenarios rep-resent a good set for training the system as they are long and varied enoughto cover the most common situations of surface movements of aircraft andcars in the roads of an airport. The first video includes five targets, four carsand luggage vehicles (T1, T2, T3 and T4) and a big airplane (T5). The sec-ond and third sequences have three aircraft (T1, T2 and T3). The secondscenario is a not difficult situation where there are three aircraft that canbe tracked very easily. Moreover, we use a simple tracking system based onrules [36]. The experiments are carried out over this simple tracking systemfollowing the methodology of adjustment and generalization explained beforeand the two aggregation functions: Minimax (Experiment I – Rules I) andSum (Experiment II – Rules II).

As it can be checked in Table 1 and we have pointed before, our methodis good for generalization, obtaining a similar performance for all the cases.The classifiers CCMSPF, CC, MS and MSFG have a brilliant behaviour inthe second scenario, the easiest one to analyse since there are only three bigaircraft and no cars or buses. Nevertheless, all the new trackers present abad performance when tracking the more difficult scenarios, in which thereare big aircraft and small moving vehicles. Thus, we can check how our opti-mized tracking system has a performance between 11,000 and 14,564 for thesedifficult cases, whereas the rest of systems present much higher values.

As a result of this, we can conclude that the optimization give us a tradeoff to have similar performance in all the scenarios we have trained. We obtaina set of parameters that provide good performance for different scenarios inan airport environment. In addition, we could highlight that good results areobtained with a very simple tracker after tuning it by means of the optimiza-tion methodology that we propose. On the other hand, more sophisticatedtrackers give good performance for easy scenarios (Video 2), whereas theycannot make it so good for difficult situation where aircraft and little movingvehicles share the taxi-road (Video 1).


Table

1.

Com

pari

son

ofour

track

ing

syst

em(r

ule

s)aft

eropti

miz

ati

on

again

stoth

ertr

ack

ing

syst

ems

Eva

luati

on

Rule

sI

Rule

sII

CC

MSP

FC

CM

SM

SFG

MSP

FC

GA

scen

ari

o(m

inim

ax)

(sum

)

Vid

eo1-T

12,3

47.6

02,2

43.1

210,0

95.7

010,0

98.7

080,1

27.8

080,1

18.5

080,1

86.3

010,0

63.0

0V

ideo

1-T

22,8

20.8

52,8

55.5

767.8

465.4

168.6

870.0

0140,1

20.0

081.7

0V

ideo

1-T

31,2

80.2

37,6

83.4

910,3

02.9

011,2

27.1

010,1

45.7

010,1

66.3

010,1

44.6

011,4

25.7

0V

ideo

1-T

43,4

16.0

51,6

76.2

210,0

81.1

010,0

00.0

010,0

00.0

010,0

89.4

010,0

00.0

010,0

00.0

0V

ideo

1-T

51,1

46.6

1105.6

358.2

973.6

378.7

573.9

411,3

16.6

049.3

8

Sum

111,0

11.3

414,5

64.0

330,6

05.8

331,4

64.8

4100,4

20.9

3100,5

18.1

4251,7

67.5

031,6

19.7

8

Vid

eo2-T

1494.7

07,5

06.2

463.6

366.4

072.9

068.0

613,1

30.1

0480.6

0V

ideo

2-T

22,0

95.8

910,9

70.6

066.7

065.2

684.4

774.1

513,5

56.9

05,7

70.3

1V

ideo

2-T

3787.5

94,5

23.2

165.1

266.6

076.7

665.8

511,7

28.2

04,5

68.7

9

Sum

23,3

78.1

823,0

00.0

5195.4

5198.2

6234.1

3208.0

638,4

15.2

010,8

19.7

0

Vid

eo3-T

15,7

66.6

83,4

65.0

38,3

62.2

21,4

79.6

416,2

31.3

09,1

72.3

77,3

41.8

29,9

59.0

1V

ideo

3-T

25,1

36.3

66,1

81.0

76,5

26.6

86,8

11.2

35,1

95.4

85,4

31.7

97,4

30.5

810,2

84.4

0V

ideo

3-T

33,1

68.6

84,3

63.2

57,1

45.3

86,8

16.5

0291.3

26,4

63.4

02,7

28.0

63,7

98.4

2

Sum

314,0

71.7

214,0

09.3

522,0

34.2

815,1

07.3

721,7

18.1

021,0

67.5

617,5

00.4

624,0

41.8

3

The

rule

str

ack

ing

resu

lts

are

use

das

ben

chm

ark

for

com

pari

son


5 Computational Intelligence Paradigms for Video DataAssociation

5.1 Video Data Association Problem Definition

Video data association is a blob-to-track multi-assignment problem. Several(or none) blobs could be assigned to the same track and simultaneously severaltracks could overlap and share common blobs. This can be formalized throughthe assignment binary matrix, A[k], defined as Aij [k] = 1 if blob bi[k] isassigned to object oj ; and Aij [k] = 0 otherwise.

The blobs extracted in the kth frame are b[k] = {b1[k], ..., bNk[k]} and the

objects tracked up to now are o[k − 1] = {o1[k], ..., oMk[k − 1]}. The size of

matrix A[k], NkxMk, changes with time since the number of blobs extracteddepends on variable effects during image processing, and the number of objectsalso changes.

In many applications, a basic metric used for data association is theobservations-to-tracks distance, dij, computed through the Mahalanobis for-mula [12]:

dij =[xj − fi

]t (P−1j )

[xj − fi

]; i = 1, ..., Nk, j = 1, ... ,Mk. (7)

The features vector fi are extracted from the sets of blobs correspondingto jth track (bi[k] such that Aij [k]=1). The estimated state vectors, xj , withstate information and associated covariance Pj , are recursively updated withassigned observations by means of a Kalman filter. In these approaches, the“optimal” decision would be the combination for A[k] such that the sum ofdistance between assigned blobs and tracks is minimized:

min∑Nk,Mk

i=1,j=1 dij .

A(8)

The number of possible solutions for Boolean matrix A is 2Nk∗Mk , so itis generally impractical to find the optimal decision through exhaustive enu-meration of all association hypotheses. Furthermore, it could be even uselesssince this metric can be an oversimplification of the real problem.

5.2 Fuzzy Association

The method proposed here, detailed in [45,46], uses a fuzzy system to analyzeinteracting blobs and tracks. It computes “confidence levels” that are used toweight each gated blob’s contribution to update the target track, includinglocation and shape. Domain knowledge, represented as rules to compute theweighs, is extracted from predefined situations (examples) to carry out an“inductive” generalization process covering all intermediate cases. This pro-cedure is based on a simplified association technique (analogous to a JPDA


Fig. 8. Fuzzy concepts used for video association

approach), but complemented with a knowledge-based system to ponder theweights of blobs under uncertain conditions and so solve situations of highcomplexity.

An explicit representation of target shape and dimensions is used in theassociation logic to select the set of updating blobs for each track. The weightsof gated blobs are based on numeric heuristics (descriptors), computed witha simple geometrical analysis. They have been detailed in [45, 47] and aresummarized next (see Fig. 8):

• Overlap. A “soft gating”, computed as the fraction of blob area containedwithin track predicted region.

• Density. It evaluates the ratio between areas of detected regions and non-detected zones (holes) in the box enclosing the reconnected set of blobs.A low value will indicate that different targets probably have originatedthem.

• Conflict. This component evaluates the likelihood of blob being in conflictwith other tracks. This problem appears when target trajectories are soclose that track gates get overlapped and share the blob.

• Coverage. The confidence on predicted track is characterized with thisheuristic to assess the confidence given to the fact that this track representsmotion of a real target. It is defined by the percentage of predicted areacovered by blobs corresponding to detected targets.

The previous heuristics are the input to relations indicating the confidencelevels both for blobs and predicted tracks in the update process. A rulebaseapproximates these relations. The detailed description of heuristics, translatedinto linguistic variables, sets, and rules appears in [45,46,48].


Target estimated shape is restricted to vary smoothly, accordingly to com-puted weights. The estimated position depends both on these blobs and tracksconfidence levels. Estimated shape (dimensions of box) is the most constrainedfeature, remaining “frozen” while the blobs confidence levels are not highenough, while estimated position is a trade-off between confidence levels esti-mated both for blobs and tracks. For instance, in horizontal coordinate, thetwo gated blobs with the minimum and maximum extremes for coordinatex, (xbmin, xbmax) are taken into account. The target shape, lH [k], is updatedconsidering the minimum blob confidence value, αminH , and the last valueestimated for last frame, lH [k − 1]:

lH [k] = αminH(xbmax − xbmin) + (1 − αminH)lH [k − 1]. (9)

So, the estimated target length (and width) is modified only if all blobshave enough confidence. Otherwise, in the case that at least one blob has lowconfidence (for instance during a multi-track conflict), the length and widthare maintained constant until full confidence is recovered. The estimated tar-get bounds (location of box) are updated close to the blob with the highestconfidence, αmaxH , considering also the value for track confidence. For in-stance, if left-hand side blob defining vale xbmin had the highest confidence,target bounds would be updated taking the bound defined by this blob andthe value predicted since last update, xmin[k − 1]:

xmin[k] = αmaxHxbmin + (1 − αmaxH)(xmin[k − 1] + vx[k − 1]T ), (10)

xmax[k] = xmin[k] + lH [k]. (11)Figure 9 shows an example of track shape update with two targets overlap-

ping while they cross. Due to the conflicting blob, the rule to lock dimensionsis applied. Bounds are computed to conform to the conflict-free blobs (withhigh confidence levels for association). So, the effect of occlusion is minimized.

Furthermore the representation of expert criteria in the rule base, learningtechniques were exploited to automatically learn and tune the proposed sys-tem. A machine-learning procedure (neuro-fuzzy technique) has been appliedin order to extract rules directly from examples, analyzing the capability toprovide right decisions in different conditions. This automatic procedure wasapplied as an alternative to tune the labels’ membership functions of linguisticvariables used to represent the knowledge.

In our work [47, 49], the fuzzy system for association used Mamdami im-plication. The Nauck/Kruse neuro-fuzzy approach was applied using directlythis type of implementation for implication operator.

Three different fuzzy systems have been tested and compared with a con-ventional data association system. Rules for the first one were obtained usingexpert knowledge, the second integrated ruled learned from pre-classified ex-amples. The rigid scheme with “hardwired” decisions was taken as a bench-mark, and compared with the fuzzy systems, considering the three variantsof rule sets mentioned above. This analysis was performed on representativescenarios processed to obtain and store the reference ground truth [49].


Fig. 9. Shape update during conflict

5.3 Video Association Using Estimation of DistributionsAlgorithms (EDAs)

Evolutionary algorithms (EAs) have demonstrated to be effective search tech-niques of general purpose. One of the main problems with EAs is the adjust-ment of its parameters, in special those related to the crossover and mutationoperators. Recently has appeared a new family of algorithms that bases theirbehavior on the statistical modeling of the genetic algorithms.

The “Estimation of Distribution” Algorithms (EDAs), [50] replace the useof an evolving population by a vector that directly codifies the joint probabil-ity distribution of vectors corresponding to the best solutions. The crossoverand mutation operators are replaced by rules that update the probability dis-tribution. A great advantage of the EDAs on the evolutionary algorithms isthat they allow expressing the interactions between variables of the problemby means of the associated joint probability distribution. In addition, theyimprove the time of convergence and the necessary space of memory for itsoperation.

EDAs present the suitable features to deal with problems requiring a veryefficient search: small populations and a few iterations, compared with themore classic approaches to evolutionary algorithms (EAs). The fundamentaldifference of EDAs with classical EAs is that the formers carry out a search


of the probability distribution describing the optimal solutions while EAs di-rectly make the search and provide the solutions to the problem with thesolutions itself. They share the necessity of codification of solutions by meansof binary chains, in the EA terminology they are the “individuals” and thedefinition of a merit measurement that allows to orient the search direction,the so called “fitness function”. In the case of EDAs, operators to manipu-late individuals in the search, such as mutation, selection, and crossover, isnot needed, since the search is performed directly on the distribution whichdescribes all possible individuals.

The high level algorithm of the EDA and EA are compared in the followingpseudocodes.EDA:

1. Generate a population randomly2. Select a set of fitter individuals3. Estimate a probabilistic model over fitter individuals4. Obtain a new set of individuals by means of sampling the probabilistic

model5. Incorporate the new set into population6. If the termination criteria is not satisfied, go to 2

EA:

1. Generate a population randomly2. Select a set of fitter individuals3. Obtain a new set of individual by means of applying crossover and muta-

tion operator4. Incorporate the new set into population5. If the termination criteria is not satisfied, go to 2

The key point of the use of EDAs is in the estimation of the joint probabil-ity distribution. The simplest situation is that in which the joint probabilitydistribution factorizes as a product of univariate and independent distribu-tions, that is to say, there is no dependency between the variables. In this situ-ation the estimation of the probability distribution is made using the marginalfrequencies. Based on the dependencies between the variables, a classificationof the EDA is made. The simplest model considers independence betweenthe variables, UMDA [51], PBIL [52], CGA [53] are algorithms characteristicof this type. The MIMIC [54] algorithm incorporates bivariate dependenciesand some examples of model for multiple dependencies are FDA [55]. Thisalgorithm uses a Bayesian network as probabilistic model, this characteristicconfers a great capacity of representing dependencies but the computationalcost is very high.

Application of EDAs to Video Association Problem

The association problem has been defined as a search over possible blob as-signments. This problem could be defined as minimizing an heuristic function


to evaluate blob assignments by an efficient algorithm (Estimation of Distrib-ution Algorithm). The heuristic function takes a Bayesian approach to modelthe errors in observations. The formulation of data association as a minimiza-tion problem solved by a genetic technique is not a handicap with respectto the required operation in real time. A worst-case number of operationscan be fixed and bound the time consumed by the algorithm, if we restrictthe maximum number of evaluations. Then, given a certain population size,the algorithm will run a number of generations limited by this bound on thenumber of evaluations. The most important aspect is that the EDA shouldconverge to acceptable solutions with these conditions of limited populationsize and number of generations.

Heuristic to Evaluate Assignments

The description of the heuristic of the search, that determines the quality ofthe solutions and guides the search toward the optimal one, is shown in thissection. An extended distance is used as evaluation function for groups ofdetected blobs assigned to tracks according to matrix A (A represents eachhypothesis to be evaluated). The heuristic is aimed at providing a measureof probability density of assigned observations to tracks. This likelihood func-tion considers several types of terms and their probabilistic characterization:the separation between tracks and centroids of groups of blobs, the “similar-ity” between track-smoothed target attributes and those extracted from blobgroups, and the events related with erasing existing tracks and creating newones. As mentioned in the introduction, the final objective is to achieve agood trade-off between capability to re-connect image regions, keeping a sin-gle track per target, and avoid at the same time the miss-assignment of blobscoming from different objects or from extraneous sources.

The extended distance allows the evaluation of a certain hypothesis forgrouping blobs in sets and assigning them to tracks. The term consideringcentroid residual, typically used in other approaches, is enriched with termsfor attributes to take into account the available structural characteristics oftargets which can be extracted from data. There are also terms consideringthat hypotheses may label some blobs as false alarms or may leave confirmedtracks with no updating blobs:

log(P (b[k]|A[k], x1,...,M [k − 1])) = log∏

jth,track

DGroup−Track(j)

=∑

jth,track

log DGroup−Track(j)

=∑

jth,track

dGroup−Track(j)

(12)

If we denote the blobs assigned to jth track as

Group(i) − Track(j) = {bi[k]|Aij [k] = 1} (13)


dij = dGroup(i)−Track(j) = dCentroid(i,j) + dAttributes(i,j) + dPD(i,j) + dPFA(i)

(14)where sub-indices i, j refer to the ith group of blobs and jth track:

• dCentroid(i,j): it is the normalized residual between jth track predictionand centroid of the assigned group of blobs under ith hypothesis.

• dAttributes(i,j): it is the normalized residual between track attributes andthose extracted from the group. Its value is given, assuming Gaussiandistribution and attribute independence.

• dPD(i,j): assesses the cost of no updating a confirmed track for those hy-potheses in which no blob is assigned to jth track. It considers the proba-bility of updating each track.

• dPFA(i): assesses the cost of labeling a blob as a false alarm, also assuminga certain probability of false alarm, PFA.

Encoding and Efficient Search with EDA Algorithms

The association consists of finding the appropriate values for assignment ma-trix A, where element A(i, j) is 1 if blob i is assigned to object j and 0 inthe opposite case. In order to be able to use the techniques of evolutionaryalgorithms, the matrix A is codified in a string of bits, being the size of matrixA N×M, with N the number of extracted blobs and M the number of objectsin the scene. A first possibility for problem encoding was tried with a stringof integer numbers representing the possible M objects to be assigned for eachblob, including the “null” track 0, as shown in Fig. 10.

This encoding requires strings of Nlog2(1 + M) bits and has the problemof constraining search to solutions in which each blob can belong to one objectat most. This could be a problem in situations where images from differentobjects get overlapped and may leave some tracks unassigned and lost. Then,a direct encoding of A matrix was used for general solutions, where the po-sitions in the string represent the assignations of blobs to tracks. With thiscodification, where individuals need N(1+M) bits, a blob can be assigned toseveral objects, see Fig. 11.

Finally, in order to allow and effective search, the initial individuals arenot randomly generated but they are fixed to solutions in which each blob isassigned to the closest object. So, the search is performed over combinationsstarting from this solution in order to optimize the heuristic after changing

Fig. 10. Simple encoding for blob assignment


Fig. 11. Direct encoding for whole A matrix

Fig. 12. Application of EDAs to maritime scenes

any of this initial configuration. Besides, for the case of EDA algorithms, thevector probabilities are constrained to be zero for the case of very far pairs andthose blobs which fall in spatial gates of more than one track have a non-zerochange probability.

In [56], authors present the application of EDAs to track boats in maritimescenes. This problem is a challenging problem due to the complex segmenta-tion of these images. The sea has continuous movement, which contributes tothe creation of a great amount of noisy blobs (see Fig. 12).

6 Conclusions

Some approaches based on computational intelligence have been applied todevelop video and multimedia processing systems in visual sensor networks.The knowledge about the domain is exploited in the form of fuzzy rules for


data association and heuristic evaluation functions to optimize the design andguide the search of appropriate decisions. The results, referring to differentworks and mainly obtained with evaluation metrics based on ground truth,showed that these strategies result in competitive solutions in the context oftheir application domains. Furthermore, the proposed multi-agent architec-ture for cooperative operation will allow gain in scalability when deployingthe system to cover wide areas.

References

1. F. Castanedo, M. A. Patricio, J. Garcia, and J. M. Molina. Extending surveil-lance systems capabilities using bdi cooperative sensor agents. In VSSN ’06:Proceedings of the Fourth ACM International Workshop on Video Surveillanceand Sensor Networks, pages 131–138, New York, NY, USA, 2006. ACM Press.

2. R. Cucchiara. Multimedia surveillance systems. In VSSN ’05: Proceedings of theThird ACM International Workshop on Video Surveillance & Sensor Networks,pages 3–10, New York, NY, USA, 2005. ACM Press.

3. M. A. Patricio, J. Carbo, O. Perez, J. Garcıa, and J. M. Molina. Multi-agentframework in visual sensor networks. EURASIP Journal on Advances in SignalProcessing, 2007:Article ID 98639, 21 pages, 2007. doi:10.1155/2007/98639.

4. R. T. Collins, A. J. Lipton, H. Fujiyoshi, and T. Kanade. Algorithms for coop-erative multisensor surveillance. In Proceedings of the IEEE, volume 89, IEEE,October 2001.

5. C. S. Regazzoni, V. Ramesh, and G. L. Foresti. Special issue on video communi-cations, processing, and understanding for third generation surveillance systems.In Proceedings of the IEEE, volume 89, October 2001.

6. B. P. L. Lo, J. Sun, and S. A. Velastin. Fusing visual and audio information ina distributed intelligent surveillance system for public transport systems. ActaAutomatica Sinica, 29(3):393–407, 2003.

7. X. Yuan, Z. Sun, Y. Varol, and G. Bebis. A distributed visual surveillancesystem. In IEEE Conference on Advanced Video and Signal based Surveillance,pages 199–205, Florida, 2003.

8. M. Valera and S.A. Velastin. Intelligent distributed surveillance systems: a re-view, 152:192–204, April 2005.

9. M. Wooldridge and N. Jennings. Intelligent agents: theory and practice. Theknowledge Engineering Review, 1995.

10. O. Perez, M. A. Patricio, J. Garcıa, and J. M. Molina. Improving the segmen-tation stage of a pedestrian tracking video-based system by means of evolutionstrategies. In Eigth European Workshop on Evolutionary Computation in ImageAnalysis and Signal Processing. EvoIASP 2006, Budapest, Hungary, April 2006.

11. E. Y. Kim and S. H. Park. Automatic video segmentation using genetic algo-rithms. Pattern Recoginition Letters, 27(11):1252–1265, 2006.

12. Samuel S. Blackman and R. Popoli. Design and Analysis of Modern TrackingSystems. Artech House, Inc., 1999.

13. D. L. Hall and J. Llinas. Handbook of MultiSensor Data Fusion. CRC Press,Boca Raton, 2001.


14. Ingemar J. Cox and Sunita L. Hingorani. An efficient implementation of reid’smultiple hypothesis tracking algorithm and its evaluation for the purpose of vi-sual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,18(2):138–150, 1996.

15. K. Pattipati, S. Deb, and Y. Bar-Shalom. A new relaxation algorithm and pas-sive sensor data association. IEEE Transactions on Automatic Control, 37:198–213, 1992.

16. Y. Ruan and P. Willett. Multiple model pmht and its application to the bench-mark radar tracking problem. IEEE Transactions on Aerospace and ElectronicSystems, 40(4):1337–1350, October 2004.

17. I. Haritaoglu, D. Harwood, and L. S. David. W4: Real-time surveillance of peo-ple and their activities. IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(8):809–830, 2000.

18. O. Perez, M. A. Patricio, J. Garcia, and J. M. Molina. Fusion of surveillanceinformation for visual sensor networks. In Proceedings of the Ninth InternationalConference on Information Fusion, Florence (Italy), July 2006.

19. A. Rao and M. Georgeff. Bdi agents: from theory to practice. In Proceedings ofthe First International Conference on Multi-Agent Systems (ICMAS’95), pages312–319, Cambridge, MA, USA, 1995. The MIT Press, Cambridge, MA.

20. M. E. Bratman. Intentions, Plans and Practical Reasoning. Harvard UniversityPress, Cambridge, MA, 1987.

21. D. Dennett. The Intentional Stance. Bradford Books, 1987.22. A. Pokahr, L. Braubach, and W. Lamersdorf. Jadex: Implementing a bdi in-

fraestructure for jade agents. Search of Innovation (Special Issue on JADE),3(3):76–85, September 2003.

23. Y. Labrou, T. Finin, and Y. Peng. Agent communication languages: The currentlandscape. IEEE Intelligent Systems, 14(2):45–52, 1999.

24. F. Castanedo, M. A. Patricio, J. Garcia, and J. M. Molina. Bottom-up/top-down coordination un a multiagent visual sensor network. In 2007 IEEE Con-ference on Advanced Video and Signal Based Surveillance (AVSS 2007). IEEEComputer Society, 2007.

25. P. J. Withagen. Object detection and segmentation for visual surveillance. ASCIdissertation series number 120, Advanced School for Computing and Imaging(ASCI), Delft University of Technology, 2005.

26. P. Lobato Correia and F. Pereira. Objective evaluation of video segmentationquality. IEEE Transactions on Image Processing, 12(2):186–200, 2003.

27. B. W. Wah. Generalization and generalizability measures. In IEEE Transactionon Knowledge and Data Engineering, volume 11, pages 175–186, 1999.

28. I. Rechenberg. Evolutionsstrategie. Friedrich Fromman Verlag, Stuttgart,Germany, 1973.

29. I. Rechenberg. Evolutionsstrategie’94. Friedrich Fromman Verlag, Stuttgart,Germany, 1994.

30. Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies? A comprehen-sive introduction. Springer, Netherlands, 2004.

31. T. Back. Evolutionary Algorithms in Theory and Practice. Oxford UniversityPress, New York, 1996.

32. D. B. Fogel, T. Back and Z. Michalewicz. Evolutionary Computation: AdvancedAlgorithms and Operators. Institute of Physics, London, 2000.

33. D. B. Fogel, T. Back and Z. Michalewicz. Evolutionary Computation: BasicAlgorithms and Operators. Institute of Physics, London, 2000.


34. D. Doermann and D. Mihalcik. Tools and techniques for video performance eval-uation. In Proceedings of the International Conference on Pattern Recognition(IPCER’00), pages 4167–4170, Barcelona, Spain, September 2000.

35. J. Garcia, J. A. Besada, A. Berlanga, J. M. Molina, G. de Miguel, and J. R.Casar. Application of evolution strategies to the design of tracking filters witha large number of specifications. 8:766–779, 2003.

36. O. Perez, J. Garcıa, A. Berlanga, and J. M. Molina. Evolving parameters ofsurveillance video systems for non-overfitted learning. Proceedings of the SeventhEuropean Workshop on Evolutionary Computation in Image Analysis and SignalProcessing (EvoIASP05), pages 386–395, 2005.

37. OpenCV. intel.com/technology/computing/opencv/index.htm, 2007.38. T. P. Chen, H. Haussecker, A. Bovyrin, R. Belenov, K. Rodyushkin, A. Kura-

nov, and V. Eruhimov. Computer vision workload analysis: Case study of videosurveillance systems. 9(2):109–118, May 2005.

39. D. da Silva Pires, R. M. Cesar-Jr, M. B. Vieira, and L. Velho. Trackingand Matching Connected Components from 3D Video. Proceedings of theXVIII Brazilian Symposium on Computer Graphics and Image Processing (SIB-GRAPI05), 05, 2005.

40. D. Comaniciu and P. Meer. Mean shift analysis and applications. ComputerVision, 1999. The Proceedings of the Seventh IEEE International Conferenceon, 2, 1999.

41. D. Comaniciu and V. Ramesh. Real-time tracking of non-rigid objects usingmean shift, July 8 2003. US Patent 6,590,999.

42. B. Zhang, W. Tian, and Z. Jin. Joint tracking algorithm using particle filterand mean shift with target model updating. Chinese Optics Letters, 4:569–572,2006.

43. L. Li, W. Huang, I. Y. H. Gu, and Q. Tian. Statistical modeling of complexbackgrounds for foreground object detection. Image Processing, IEEE Transac-tions on, 13(11):1459–1472, 2004.

44. F. Cupertino, E. Mininno, and D. Naso. Elitist Compact Genetic Algorithmsfor Induction Motor Self-tuning Control. Evolutionary Computation, 2006. CEC2006. IEEE Congress on, pages 3057–3063, 2006.

45. J. Garcıa, J. M. Molina, J. A. Besada, and J. I. Portillo. A multitarget trackingvideo system based on fuzzy and neuro-fuzzy techniques. EURASIP Journal onApplied Signal Processing, 14:2341–2358, 2005.

46. J. Garcıa, J. A. Besada, J. M. Molina, J. Portillo, and J. R. Casar. Robustobject tracking with fuzzy shape estimation. In FUSION ’02: Proceedings ofthe International Conference on Information Fusion, Washington, DC, USA,2002. IEEE ISIF.

47. J. M. Molina, J. Garcıa, O. Perez, J. Carbo, A. Berlanga, and J. Portillo. Ap-plying fuzzy logic in video surveillance systems. Mathware and Soft Computing,12(3):185–198, 2005.

48. J. Garcıa, J. A. Besada, J. M. Molina, J. I. Portillo, and G. de Miguel. Fuzzydata association for image-based tracking in dense scenarios. In David B. Fogel,Mohamed A. El-Sharkawi, Xin Yao, Garry Greenwood, Hitoshi Iba, PaulMarrow, and Mark Shackleton, editors, Proceedings of the 2002 Congress onEvolutionary Computation CEC2002. IEEE Press, 2002.

49. J. Garcıa, O. Perez, A. Berlanga, and J. M. Molina. An evaluation metric foradjusting parameters of surveillance video systems, chapter in Computer Visionand Robotics. Nova Science Publishers, 2004.


50. P. Larraniaga and J. A. Lozano. Estimation of Distribution Algorithms: A NewTool for Evolutionary Computation. Kluwer, Norwell, MA, USA, 2001.

51. H. Muhlenbein. The equation for response to selection and its use for prediction.Evolutionary Computation, 5(3):303–346, 1997.

52. S. Baluja. Population-based incremental learning: A method for integratinggenetic search based function optimization and competitive learning,. TechnicalReport CMU-CS-94-163, CMU-CS, Pittsburgh, PA, 1994.

53. G. R. Harik, F. G. Lobo, and D. E. Goldberg. The compact genetic algorithm.IEEE-EC, 3(4):287, November 1999.

54. Jeremy S. de Bonet, Charles L. Isbell, Jr., and Paul Viola. MIMIC: Findingoptima by estimating probability densities. In Michael C. Mozer, Michael I.Jordan, and Thomas Petsche, editors, Advances in Neural Information Process-ing Systems, volume 9, page 424. The MIT Press, Cambridge, MA, 1997.

55. H. Muhlenbein and T. Mahnig. The factorized distribution algorithm for addi-tively decompressed functions. In 1999 Congress on Evolutionary Computation,pages 752–759, Piscataway, NJ, 1999. IEEE Service Center.

56. M. A. Patricio, J. Garcıa, A. Berlanga, and J. M. Molina. Video tracking associ-ation problem using estimation of distribution algorithms in complex scenes. InArtificial Intelligence and Knowledge Engineering Applications: A BioinspiredApproach: First International Work-Conference on the Interplay Between Nat-ural and Artificial Computation, Lecture Notes in Computer Science. SpringerBerlin Heidelberg New York, 2007.

Scalability and Evaluation of ContextualImmune Model for Web Mining

S�lawomir T. Wierzchon1,2, Krzysztof Ciesielski1, and Mieczys�lawA. K�lopotek1,3

1 Institute of Computer Science, Polish Academy of Sciences,Ordona 21, 01-237 Warszawa, Polandstw,kciesiel,[email protected]

2 Faculty of Mathematics, Physics and Informatics, Gdansk University,Wita Stwosza 57, 80-952 Gdansk-Oliwa

3 Institute of Computer Science, University of Podlasie,Konarskiego 2, 08-110 Siedlce

Summary. In this chapter we focus on some problems concerning application ofan immune-based algorithm to extraction and visualization of cluster structure.Particularly a hierarchical, topic-sensitive approach is proposed; it appears to be arobust solution to the problem of scalability of document map generation process(both in terms of time and space complexity). This approach relies upon extractionof a hierarchy of concepts, i.e. almost homogenous groups of documents describedby unique sets of terms. To represent the content of each context a modified versionthe aiNet [9] algorithm is employed; it was chosen because of its natural abilityto represent internal patterns existing in a training set. Careful evaluation of theeffectiveness of the novel text clustering procedure is presented in section reportingexperiments.

1 Introduction

When analyzing the number of terms per query in one billion accesses tothe Altavista site, [12], extraordinary results were observed by Alan Gilchrist:(a) in 20.6% queries no term was entered, (b) in almost 25% queries only oneterm was used in a search, and (c) the average was not much higher thantwo terms! This justifies our interest in looking for a more “user-friendly”interfaces to web-browsers.

A first stage in improving the effectiveness of Information Retrieval (IR)systems was to apply the idea of clustering inspired by earlier studies ofSalton, [21], and reinvigorated by Rijsbergen’s Cluster Hypothesis [24]. Ac-cording to this hypothesis, relevant documents tend to be highly similar toeach other, and therefore tend to appear in the same clusters. Thus, it is pos-sible to reduce the number of documents that need to be compared to a givenS.T. Wierzchon et al.: Scalability and Evaluation of Contextual Immune Model for Web Mining,



380 S.T. Wierzchon et al.

query, as it suffices to match the query against cluster representatives first.However such an approach offers only technical improvement in searching rele-vant documents. A more radical improvement can be gained by using so-calleddocument maps [2], where a graphical representation allows additionally toconvey information about the relationships of individual documents or groupof documents. Document maps are primarily oriented towards visualizationof a certain similarity of a collection of documents, although other usage ofsuch the maps is possible – consult Chap. 5 in [2] for details.

The most prominent representative of this direction is the WEBSOMproject1. Here the self-organizing map, or SOM, algorithm [19] is used toorganize miscellaneous text documents onto a two-dimensional grid so thatrelated documents appear close to each other. Each grid unit contains a setof closely related documents. The color intensity reflects dissimilarity amongneighboring units: the lighter shade the more similar neighboring units are.Unfortunately this approach is time and space consuming, and rises questionsof scaling and updating of document maps (although some improvements arereported in [20]). To overcome some of these problems the DocMINER systemwas proposed in [2]. It composes a number of methods from explorative dataanalysis to support effectively information access for knowledge managementtasks. Particularly, a given collection of documents represented as vectors inhighly dimensional vector space is moved – by a multidimensional scaling algo-rithm – to so-called semantic document space in which document similaritiesare reinforced. Then the topological structure of the semantic document spaceis mapped to a two-dimensional grid using the SOM algorithm.

Still, the profound problem of map-like representation of document collec-tions is the issue of scalability which is strongly related to high dimensionality.While multidimensional scaling and other specialized techniques, like PCA,versions of SVD, etc. reduce the dimensionality of the space formally, theymay result in increased complexity of document representation (which had alow number of non-zero coordinates in the high-dimensional space, and hasmore non-zero coordinates in the reduced space). So some other way of dimen-sionality reduction, via feature selection and not feature construction, shouldbe pursued.

Note that the map of documents collection is a new kind of clustering,where not only the documents are split into groups, but also there exists astructural relationship between clusters, reflected by the topology of a map.We can say we have to do with a cluster networking. This affects the closelyrelated issue of evaluation of the quality of the obtained clusters. Usually thequality evaluation function is a driving factor behind the clustering algorithmand hence partially determines its complexity and success. While the conven-tional external and internal cluster evaluations criteria (like class purity, classuniformity, inter-class dissimilarity) are abundant, they are primarily devised

1 Details and full bibliography concerning WEBSOM can be found at the web-pagehttp://websom.hut.fi/websom/.

Scalability and Evaluation of Contextual Immune Model for Web Mining 381

to evaluate the sets of independent (not linked) clusters, there exist no sat-isfactory evaluation criteria for cluster network quality. Beside SOM, thereare other clustering methods like growing neural gas (GNG) [11] or artificialimmune systems (AIS) [9, 25] that face similar problems.

In our research project BEATCA [18], oriented towards exploration andnavigation in large collections of documents a fully-fledged search engine capa-ble of representing on-line replies to queries in graphical form on a documentmap has been designed and constructed [16]. A number of machine-learningtechniques, like fast algorithm for Bayesian networks construction [18], SVDanalysis, (GNG) [11], SOM algorithm, etc. have been employed to realizethe project. BEATCA extends the main goals of WEBSOM by a multilingualapproach, new forms of geometrical representation (besides rectangular maps,projections onto sphere and torus surface are possible).

The process of document map creation is rather complicated and consistsof the following main stages: (1) document crawling, (2) indexing, (3) topicsidentification, (4) document grouping, (5) group-to-map transformation, (6)map region identification, (7) group and region labeling, and finally, (8) visu-alization. At each of theses stages various decisions should be made implyingdifferent views of the document map.

Within such a framework, in this chapter we propose a new solution to theproblem of scalability and of evaluation of the quality of the cluster network.In particular, the contribution of this chapter concerns: (1) invention of a newartificial immune algorithm for handling large-scale document collections, toreplace the traditional SOM in document map formation, (2) invention of anew representation of the document space, in which instead of single pointstatistics of terms their distributions (histograms) are exploited, (3) inventionof a measure of quality of networked clustering of document collections, whichis based on the above-mentioned histograms, and which evaluates the qualityof both the clustering of documents into the groups as well as usefulness ofthe inter-group links. These new features are of particular value within ourframework of contextual document space representation, described in earlierpublications, allowing for a more radical intrinsic dimensionality reduction,permitting efficient and predominantly local processing of documents.

In Sect. 2 we present our hierarchical, topic-sensitive approach, whichappears to be a robust solution to the problem of scalability of map gen-eration process (both in terms of time complexity and space requirements).It relies upon extraction of a hierarchy of concepts, i.e. almost homogenousgroups2 of documents. Any homogenous group is called here a “context”, inwhich further document processing steps – like computation of term-frequencyrelated measures, keyword extraction, and dimensionality reduction – are car-ried out, so that each context is described by unique set of terms. To representthe content of each context a modified version of the aiNet algorithm [10] was

2 By a homogegous group we understand hereafter a set of documents belongingto a single cluster after a clustering process.


employed – see Sect. 3. This algorithm was chosen because of its ability ofrepresenting internal patterns existing in a training set. More precisely, theaiNet produces a compressed data representation for the vectors through theprocess resembling data edition. Next this reduced representation is clustered;the original aiNet algorithm uses hierarchical clustering [10], while we proposeoriginal and much more efficient procedure.

Further, the method of representing documents and groups of documentsin the vector space was enriched: Instead of traditional single point measurewe apply the histograms of term occurrence distributions in some conceptualspace so that the document content patterns would be matched in a morerefined way – see Sect. 4 for details.

To evaluate the effectiveness of the novel text clustering procedure it hasbeen compared to the aiNet and SOM algorithms in Sect. 5. In the experi-mental Sects. 5.6–5.8 we have also investigated issues such as evaluation ofimmune network structure and the influence of the chosen antibody/antigenrepresentation on the resulting immune memory model. Final conclusions aregiven in Sect. 7.

1.1 Document Maps

Before going into details let us devote a little bit attention to the conceptof a document map as such. Formally, a document map can be understoodas a two-dimensional rectangle (or any other geometrical figure) split intodisjoint areas, usually squares or hexagons3, called “cells”. To each cell aset of documents is assigned, thus a single cell may be viewed as a kind ofdocument cluster. The cells are frequently clustered into so-called regions onthe ground of similarity of their content. The cells (and regions) are labeledby the keywords best-describing cell/region content, where “best-describing”is intended to mean entire characteristic of the cell/region, but distinguishingit from surrounding cells/regions. A document map is visualized in such a waythat cell colors (or textures) represent the number of documents it contains, orthe degree of similarity to the surrounding cells, the importance of documents(e.g. PageRank), the count of documents retrieved in the recent query, or anyother feature significant from the point of view of the user. The labels of somecell/region are also displayed, but with a “density” not prohibiting the overallreadability. Optionally, labels may be displayed in “mouse-over” fashion.

2 Contextual Local Networks

In our approach – like in many traditional IR systems – documents are mappedinto T -dimensional term vector space. The points (documents) in this spaceare of the form (w1,d, . . . , wT,d) where T stands for the number of terms, and

3 For non-Euclidian geometries other possibilities exist – cf. [18].


each wt,d is a weight for term t in document d, so-called term frequency/inversedocument frequency, tfidf, weight:

wt,d = w(t, d) = ftd · log(

N

ft

), (1)

where ftd is the number of occurrences of term t in document d, ft is the num-ber of documents containing term t and N is the total number of documents.

The vector space model has been criticized for some disadvantages, pol-ysemy and synonymy, among others, [3]. To overcome these disadvantages acontextual approach has been proposed [18] relying upon dividing the set ofdocuments into a number of homogenous and disjoint subgroups (clusters).During the dimensionality reduction process, each of the clusters, called also“contexts” (for reasons obvious later), will be described by a unique subset ofterms.

In the sequel we will distinguish between hierarchical and contextual modelof documents treatment. In the former the dimensionality reduction process isrun globally, for the whole collection of documents, so that the terms used fordocument description are identical for each subgroup of documents, and thecomputation of tfidf weights, defined in equation (1) is based on the wholedocument collection. In the later model, for each subgroup the dimensionalityreduction process is run separately, so that each subgroup may be describedby a different subset of terms weighted in accordance with the equation (4).Finally, whenever we do not carry out a clustering of documents and weconstruct a single, “flat”, representation for entire collection – we will speakabout a global model.4

The contextual approach consists of two main stages. At first stage a hier-archical model is built, i.e. a collection D of documents is recurrently divided –by using Fuzzy ISODATA algorithm [4] – into homogenous groups consistingof approximately identical number of elements. Such a procedure results in ahierarchy represented by a tree of clusters. The process of partitioning haltswhen the number of documents inside each group meets predefined criteria5.To compute the distance dist(d, v) of a document d from a cluster centroid v,the cosine distance was used:

dist(d, v) = 1 − <d/||d||, v/||v||> = 1 − (d/‖d‖)T, (v/‖v‖) (2)

where the symbol < ·, · > stands for the dot-product of two vectors. GivenmdG, the degree of membership of a document d to a group G, (obtained viathe Fuzzy-ISODATA algorithm) this document is assigned to the group withhighest value of mdG.4 The principal difference between the “hierarchical” and the “global” models is

that in the hierarchical model we distinguish a number of clusters, while in theglobal model we treat the whole collection as a single cluster.

5 Currently a single criterion saying that the cardinality ci of ith cluster cannotexceed a given boundaries [cmin, cmax]. This way the maps created for each groupat the same level of a given hierarchy will contain similar number of documents.


The second phase of contextual document processing relies upon divisionof terms space (dictionary) into – possibly overlapping – subspaces of termsspecific to each context (i.e. the group extracted in previous stage). The fuzzymembership level, mtG, representing importance of a particular term t in agiven context G is computed as:

mtG =∑

d∈G (ftd · mdG)fG ·

∑d∈G mdG

, (3)

where fG is the number of documents in the cluster G, mdG is the degree ofmembership of document d to group G, ftd is the number of occurrences ofterm t in document d. We assume that a term t is relevant for a given contextG if mtG > ε, where ε is a parameter.

Removing non-relevant terms leads to the topic-sensitive reduction of thedimension of the terms space. This reduction results in a new vector repre-sentation of documents; each component of the vector is computed accordingto the equation:

wtdG = ftd · mtG · log(

fG

ftG · mtG

), (4)

where ftG is the number of documents in the group G containing term t.To depict similarity relation between contexts (represented by a set of con-

textual models), additional “global” map is required. Such a model becomesthe root of contextual maps hierarchy. Main map is created in a manner sim-ilar to previously created maps with one distinction: an example in trainingdata is a weighted centroid of referential vectors of the corresponding con-textual model: vi =

∑c∈Ci

(|c| · vc), where Ci is the set of antibodies6 in ithcontextual model (obtained from Fuzzy-ISODATA), |c| is the density of theantibody, i.e. the number of assigned documents and vc is its referential vector.

The whole process of learning contextual model (summarized in thepseudocode 1) is to some extent similar to the hierarchical learning [13]. How-ever, in our approach each constituent model, and the corresponding con-textual map, can be processed independently (particularly, in parallel). Alsoa partial incremental update of such a model appears to be much easier toperform, both in terms of model quality, stability and time complexity. Thepossibility of incremental learning stems from the fact that the very nature ofthe learning process is iterative. So if new documents come, we can considerthe learning process as having been stopped at some stage and it is resumednow with all the documents. We claim that it is not necessary to start thelearning process from scratch neither in the case that the new documents “fit”the distribution of the previous ones nor when their term distribution is sig-nificantly different. This claim is supported by experimental results presented,e.g in [18].

6 This notion is explained in Sect. 3.1.


Algorithm 1 Scheme of the meta-algorithm of contextual processing1. Index the whole set of documents and collect global frequency statistics for

terms2. Create global vector representation of documents and identify globally signifi-

cant terms (global reduction of dimensionality)3. Identify major themes in the document collection4. Based on global representation and major themes carry out fuzzy splitting of

the document collection and reduce the term space5. Create initial contextual groups and compute contextual statistics for the terms6. Identify locally significant terms and create contextual vector representation for

the individual groups7. Create the contextual model (a hierarchy of network models, based on local

vector representations)8. Create map-like visualization of the contextual model and find labels for docu-

ment groups (network nodes and map cells)9. Adapt the existing model in response to changes of objective factors (data

changes) or subjective factors (personalization, response to changes in user pro-file):

a) Modify local statistics of individual contexts and modify vector representa-tions taking into account the significance of terms

b) Modify the existent split into contextsc) Start incremental learning of existing contextual modelsd) Create a new map-like visualization of the modified contextual model and

update the group and cell labels

3 Immune Approach to Text Data Clustering

One of main goals of the BEATCA project was to create multidimensionaldocument maps in which geometrical vicinity would reflect conceptual close-ness of documents in a given document set. Additional navigational informa-tion (based on hyperlinks between documents) can be introduced to visualizedirections and strength of between-group topical connections.

Clustering and content labeling is the crucial issue for understanding thetwo-dimensional map by a user. We started our research with the WEBSOMapproach which, appeared to be unsatisfactory: both speed and clusteringstability were not very encouraging.

In SOM algorithm, [19] each unit of an K × K grid contains so-calledreference vector vi, whose dimension agrees with the dimension of trainingexamples. The training examples are repeatedly presented to the networkuntil a termination criterion is satisfied. When an example x(t) is presentedat time t to the network, its reference vectors are updated according to the rule

vi(t + 1) = vi(t) + αi(t) · (x(t) − vi(t)) , i = 1, ..., |K| × |K|, (5)


where αi(t) is so-called learning rate varying according to the equation:

αi(t) = ε(t) · exp(−dist(i, w)

σ2(t)

). (6)

Here ε(t) and σ(t) are two user defined monotone decreasing functions of timecalled, respectively, step size (or cooling schedule) and neighborhood radius.The symbol dist(i, w) stands for the distance (usually Manhattan distance)between ith unit and so-called winner unit (i.e. the unit which reference vectoris most similar to the example x(t)).

The main deficiencies of SOM are (cf. [1]): (a) it is order dependent, i.e. thecomponents of final weight vectors are affected by the order in which trainingexamples are presented, (b) the components of these vectors may be severelyaffected by noise and outliers, (c) the size of the grid, the step size and the sizeof the neighborhood must be tuned individually for each data-set to achieveuseful results, (d) high computational complexity.

GNG [11] uses the same equation (5) to update reference vectors but withfixed learning rate α. Further its output is rather graph and not a grid. Themain idea is such that starting from very few nodes (typically, two), one newnode is inserted ever λ iterations near the node featuring the local error mea-surement. There is also a possibility to remove nodes: every λ iterations thenode with lowest utility for error reduction is removed. The main disadvan-tages of GNG are (cf. [1]): (a) in comparison with SOM it requires largernumber of control parameters which should be tuned, (b) because of fixedlearning rate it lacks stability, (c) rather elaborated technique for visualizingresulting graph must be invented.

An immune algorithm is able to generate the reference vectors (called an-tibodies) each of which summarizes basic properties of a small group of docu-ments treated here as antigens7. This way the clusters in the immune networkspanned over the set of antibodies will serve as internal images, responsiblefor mapping existing clusters in the document collection into network clusters.In essence, this approach can be viewed as a successful instance of exemplar-based learning giving an answer to the question “what examples to store foruse during generalization, in order to avoid excessive storage and time com-plexity, and possibly to improve generalization accuracy by avoiding noise andoverfitting”, [26].

3.1 aiNet Algorithm for Data Clustering

The artificial immune system aiNet [10] mimics the processes of clonal selec-tion, maturation and apoptosis [9] observed in the natural immune system.Its aim is to produce a set of antibodies binding a given set of antigens (i.e.7 Intuitively by antigens we understand any substance threatening proper function-

ing of the host organism while antibodies are protein molecules produced to bindantigens. A detailed description of these concepts can be found in [9].


documents). The efficient antibodies form a kind of immune memory capableto bind new antigens sufficiently similar to these from the training set.

Like in SOM and GNG, the antigens are repeatedly presented to the mem-ory cells (being matured antibodies) until a termination criterion is satisfied.More precisely, a memory structure M consisting of matured antibodies isinitiated randomly with few cells. When an antigen agi is presented to thesystem, its affinity aff (agi, abj) to all the memory cells is computed. The valueof aff (agi, abj) expresses how strongly the antibody abj binds the antigen agi.From a practical point of view aff (agi, abj) can be treated as a degree of sim-ilarity between these two cells8. The greater affinity aff (agi, abj), the morestimulated abj is.

The idea of clonal selection and maturation translates into next steps (hereσd, and σs are parameters). The cells which are most stimulated by the antigenare subjected to clonal selection (i.e. each cell produces a number of copiesproportionally to the degree of its stimulation), and each clone is subjectedto mutation (the intensity of mutation is inversely proportional to the degreeof stimulation of the mother cell). Only clones cl which can cope successfullywith the antigen (i.e. aff (agi, cl) > σd) survive. They are added to a tentativememory, Mt, and the process of clonal suppression starts: an antibody abj toosimilar to another antibody abk (i.e. aff (abj , abk) > σs) is removed from Mt.Remaining cells are added to the global memory M .

These steps are repeated until all antigens are presented to the system.Next the degree of affinity between all pairs abj , abk ∈ M is computed andagain too similar – in fact: redundant – cells are removed from the memory.This step represents network suppression of the immune cells. Lastly r% (onemore parameter) of the worst individuals in M are replaced by freshly gen-erated cells. This ends one epoch, and next epoch begins until a terminationcondition is met.

Among all the parameters mentioned above the crucial one seems to bethe σs as it critically influences the size of the global memory. Each memorycell can be viewed as an exemplar which summarizes important features of“bundles” of antigens stimulating it.

3.2 Identification of Redundant Antibodies

Clonal suppression stage requires |Mt| ·(|Mt|−1)/2 calculations of the affinitybetween all pairs of the cells in Mt. To reduce time complexity of this stepwe refer to the agglomerative clustering approach. The crucial concept hereis to manage matrix of distances in a smart way and to update only those

8 In practical applications this measure can be derived from any metric dissimi-

larity measure dist as aff (agi, abj) =dmax−dist(agi,abj)

dmax, where dmax stands for

the maximal dissimilarity between two cells. Another possibility – used in ourapproach – is to assume that the affinity is inversely proportional to the distancebetween corresponding molecules.


distances which have really changed after merging two clusters. Among manypossible solutions, we have applied so-called partial similarity matrix andupdate algorithm presented in [14]. Authors have shown that the expectedcomplexity of a single-step update is of order of O(2·N ·G·g), where N isthe number of objects, G is the maximum number of clusters, g << G isthe maximal number of column rescanning and modifications after clustersmerging step. It is significantly less than the O(N3) complexity of a naiveapproach. Finally, the reduced antibodies are replaced by a single cell beingthe center of gravity of the set of removed antibodies. Thus, we not only reducethe size of the immune network, but presumably compress an informationcontained in a set of specialized antibodies to the new, universal antibody.

3.3 Robust Construction of Mutated Antibodies

In case of high-dimensional data, such as text data represented in vector space,calculation of stimulation level is quite costly (proportional to the number ofdifferent terms in dictionary). Thus, the complexity of an immune algorithmcan be significantly reduced if we could restrict the number of required expen-sive recalculations of stimulation level. The direct, high-dimensional calcula-tions can be replaced by operations on scalar values on the basis of the simplegeometrical observation that a stimulation of a mutated antibody clone canbe expressed in terms of original antibody stimulation.

Such an optimization is based on the generalized Pythagoras theorem: if v1,v2, v3 are the sides of a triangle (v1 +v2 +v3 = 0) then |v3|2 = |v1|2 + |v2|2−2 ·|v1| · |v2| ·cos(v1, v2). We can define mutated clone m as: m = κ ·d+(1−κ) · c,where c is cloned antibody, d is antigen (document) and κ is the randommutation level.

Taking advantage of equation (5) and Pythagoras theorem (where v1 :=d′ = κ · d, v2 := c′ = (1 − κ) · c, v3 := −m) and having calculated originalantibody stimulation aff (c, d), we can calculate mutated clone stimulationlevel aff (m, d) as follows. Let

P = cos(c′, d′) = cos(c, d) = 1 − aff(c, d) (7)

and the scalar product〈c, d〉 = P · |c | · |d|. (8)

Then the norm of the mutated antibody is

|m|2 = |d′|2+|c′|2+2·P ·|c ′|·|d′| = κ2 ·|d|2+(1−κ)2 ·|c |2+2·κ·(1−κ)·P ·|c |·|d|.(9)

Let us further define

s = κ · |d|2 + (1 − κ) · |c | · |d| = κ|d|2 + (1 − κ) · P · |c | · |d|. (10)


Finally,aff(m, d) =

s

|m| · |d| . (11)

Dually, we can find mutation threshold κ so that mutated antibody clonestimulation aff (m, d) < σd. Precisely, we are looking for constant value K suchthat aff(m, d) = σd. Subsequently K can be used to create mutated antibodyfor random mutation level κ ∈ (0,K). The advantage of such an approachis the reduction of the number of inefficient (too specific) antibodies, whichwould be created and immediately removed from the clonal memory.

Analogically to the previous inference, if we define

p = aff(c, d)

x = −p · |d| + p2 · |c | + σ2d · (p · |d| − c )

y = |d|2 − 2 · p · |c | · |d| + p2 · |c |2 − σ2d · (|d|2 − |c |2 + 2 · p · |c | · |d|)

z = σd · |d|√

(p2 − 1) · (σ2d − 1),

then the sought value of mutation threshold K is

K =|c | · (x + z)

y. (12)

3.4 Stabilization Via Time-Dependent Parameters

Typical problem with immune based algorithms is the stabilization of the sizeof the memory cells set. This explains why we decided to use time dependentparameters. For each parameter p, we defined its initial value p0 and the finalvalue p1 as well as the time-dependent function f(t), such that p(t) = f(t)and p(0) = p0, p(T ) = p1 where T is the number of learning iterations.

In particular, both σs(t) and σd(t) are reciprocally increased, while mb(t) –the number of clones produced by a cell – is linearly decreased with time:σ(t) = σ0 + (σ1 − σ0) · t·(T+1)

T ·(t+1) and mb(t) = m0 + m1−m0T · t, where σ0 = 0.05,

σ1 = 0.25 for σs(t); σ0 = 0.1, σ1 = 0.4 for σd(t); m0 = 3, m1 = 1 for mb(t).

3.5 Robust Antibody Search in Immune Network

One of the most computationally demanding parts of any AIS algorithm isthe search for the best fitted (most stimulated) antibodies. Especially, in ap-plication to web documents, where both the text corpus size and the numberof cells in the immune, called also idiotypic, network is huge, the cost of evena single global search phase in the network is prohibitive.

Unfortunately, experiments showed that neither local search method(i.e. searching through the graph edges of the idiotypic network from last-iteration’s starting cell) nor joint-winner search method (our own approachdevoted to SOM learning [16]) are directly applicable to idiotypic networks.


We propose a replacement of the global search approach with a modified lo-cal search9. The modification relies upon remembering most stimulated cell formore than one connected component of the idiotypic network and to conductin parallel a single local-winner search thread for each component. Obviously,it requires one-for-iteration recalculation of connected components, but thisis not very expensive: the complexity of this process is of order O(V + E),where V is the number of cells and E is the number of connections (graphedges).

A special case is the possibility of an antibody removal during the tthlearning iteration. When the previous iteration’s most stimulated antibodyfor a given document (antigen) has been removed from the model (i.e. currentsystem’s memory), we activate search processes (in parallel threads) from eachof its direct neighbors in the previous iteration’s graph.

We have developed another, slightly complicated, but more accuratemethod. It exploits well-known Clustering Feature Tree (CF-Tree, [27]) togroup similar network cells in dense clusters. Antibody clusters are arrangedin the hierarchy and stored in a balanced search tree. Thus, finding moststimulated (similar) antibody for a document requires O(logtV ) comparisons,where t is the tree branching factor (refer to [27] for details). Amortized treestructure maintenance cost (insertion and removal) is also proportional toO(logtV ).

3.6 Adaptive Visualization of the AIS Network

Despite many advantages over SOM approach, AIS networks have one seri-ous drawback: high-dimensional networks cannot be easily visualized. In ourapproach the immune cells are projected onto a regular Kohonen grid. To ini-tialize such a grid properly, a given group of documents is divided into smallnumber of disjoint group (main topics) by using fast ETC Bayesian tree [15].The centers of the main topics are uniformly spread over the map surface,and remaining cells of the grid are initialized with intermediate topics calcu-lated as the weighted average of main topics, with the weight proportional tothe Euclidean distance from the corresponding cells representing main topics.This way geographical neighborhood on he grid corresponds to the graphicalneighborhood in the immune network.

After initialization, the map is learned with the standard SOM algo-rithm [19]. Finally, we adopt attraction-repelling algorithm [23] to adjust theposition of AIS antibodies on the SOM projection map, so that the distanceon the map reflects as close as possible the similarity of the adjacent cells.The topical initialization of the map is crucial here to assure the stability of

9 In such a procedure searching for the most stimulated antibody in tth iterationstarts from the most stimulated antibody identified in (t−1)th iteration and nextthe graph edges of the idiotypic network are traversed appropriately.


the final visualization [16]. The resulting map visualizes AIS network withresolution depending on the SOM size (a single SOM cell can gather morethan one AIS antibody).

4 Histograms in Vector Spaces

As has been said in previous sections, the coordinate value referring to theterm ti in the vector dj representing the whole document is equal to thevalue of the pondering (term-weighting) function f(ti, dj). This function mayponder the term globally, like tfidf defined in (1), or locally, like the contextualfunction fG(ti, dj) = wtdG, defined in (4).

In the following subsection we will extend this representation by an in-formation about the distribution of pondering function values for individualdimensions in the vector space. Subsequently we will describe possible appli-cations of this information.

4.1 Distributions of the Function Pondering the Terms

Properties of each term can be considered individually (for a single document),or in the broader context of a given (sub)set of documents D. In the lattercase we can consider the values of the pondering function for a given term tfor each document d ∈ D as observed values of a random variable with un-derlying continuous probability distribution. In practical cases the continuousdistribution will be approximated by a discrete one, so that the informationabout the random variable distribution for the term t can be summarized asa histogram Ht,D.

Let us consider the document d ∈ D and the pondering function f . Weshall represent this document by a normalized vector d =

[f ′0,d, . . . , f

′T,d

],

where f ′t,d = ‖d‖−1 · ft,d for t = 0, . . . , T . After normalization, all the docu-

ments are located within the unit hypercube [0, 1]T .For a fixed number Qt,D of intervals of the histogram Ht,D we define the

discretization ∆t,D : [0, 1] �→ {0, . . . , Qt,D − 1}, i.e. the transformation of thenormalized pondering function into the index of interval.

In the simplest case it can be a uniform split of the interval [0, 1] intosegments of equal length ∆t,D(f ′

t,d) = (Qt,D − 1) · f ′t,d!. An efficient dis-

cretization, however, should take into account the fact, that the ponderingfunction for a fixed term takes values in only a subset of the unit interval(like in the case of splitting the set of documents into homogenous subsets, asdone in contextual approach). Optimal discretization should also approximatequantile-based split of the underlying pondering function distribution.

Having defined the discretization ∆t,D and a fixed interval q, let us definethe characteristic function:

χ(∆t,D

(f ′

t,d

), q)

=

{1 if ∆t,D

(f ′

t,d

)= q

0 otherwise. (13)


Then we compute the value assigned to the interval q of the histogram Ht,D

for the term t in document collection D as:

Ht,D(q) =∑d∈D

χ(∆t,D

(f ′

t,d

), q). (14)

So individual intervals of the histogram Ht,D represent the number of oc-currences of a discretized value of the pondering function f for the term t inthe document collection D. The interval values can be in turn transformedto relative frequencies via the plain normalization H ′

t,D(q) = Ht,D(q)/Tt,D,where Tt,D =

∑i∈Θ Ht,D(q) is the total number of documents d ∈ D con-

taining the term t. The frequency distribution approximates the probabilitydistribution of the unknown variable describing the weight of occurrence ofterm t in a randomly chosen document d ∈ D. A term not occurring in anydocument of the collection will be represented by an “empty histogram”, hav-ing zero frequencies assigned to all intervals.

Note that the definition of the characteristic function (13) can be gener-alized to a more elaborated variant in which adding a document into to acontext will be associated with:

• The degree of similarity of the document to the context (cf. Sect. 4.4)• Document quality (as an aggregation of quality values for terms contained

in a document)• Both above-mentioned factors

In this general case instead of unit counts, the respective counts related tointervals of individual histograms would be increased by a quantity propor-tional to document quality and the degree of membership to a given context.Such an approach allows also to take into account the fuzzy membership ofsome documents, prohibiting from assignment to a unique context.

4.2 Practical Aspects of Usage of Histograms

The collection of histograms H(t,D) for t ∈ TD, obtained from the documentcollection D, via the previously described transformation, can be treated asan aggregated information on the (sub)space, in which the documents reside.Subsequently we show how this information can be exploited.

First we shall stress that the maintenance of the histograms is not a bigburden for the computer memory even for large document collections, due to:

• Initial reduction of term space dimensionality [17]• Additional reduction and clustering of terms in case of contextual process-

ing• Compact carrier (set of non-zero valued histogram elements) for majority

of terms dla


This implies that histograms may be represented by sparse matrices or (ina still more efficient way) by cyclic tables, indexing all intervals with non-zerofrequencies (carrier compactness assumption). For example a full histogramdescription (histograms with the dictionary) for 20,000 documents from the 20NewsGroups collection divided into 20 contextual groups requires as little as6 MB RAM (about 300 KB for a single context, the number of terms spanningthe vector space: 7030).

4.3 Significance of a Term in a Context

With the rapid growth of dictionary size T , the most important task is theidentification of the most significant terms, most strongly affecting clusteringand classification of documents as well as the description of the resultingclusters (keywords extraction). Also the impact of irrelevant terms needs tobe bounded, since their number grows much more rapidly than the numberof significant terms.

The first stage in differentiation of the term significance is the dictionaryreduction process. It is a kind of “binary” differentiation: non-significant termsare simply removed from further stages of document processing. The dictio-nary reduction can be conducted in two phases: the global and contextual.

Beside dictionary reduction (removal of least important terms), introduc-tion of contextual pondering function (see (4) in Sect. 2) leads also to diver-sification of the significance of the remaining terms. We are interested also insimilar diversification expressed as a function of features of term histograms.The advantage of such a term description, as we will see in subsequent sec-tions, is not only an effective application in tasks like document classificationor identification of keywords for clusters of documents and contexts, but alsothe possibility of dynamic adaptation understood here as the change of thepondering function (hence also of the vector representation of documents)parallel to the process of incremental clustering.

Intuitively, less significant terms are represented by histograms with thefollowing features:

• High value of curtosis (a histogram with high peaks), which is especiallyvisible for terms that are frequent and occur uniformly in the documentcollection, hence are less characteristic

• The domain (the carrier) of the histogram is relatively “compact”, havingfew intervals with non-zero coordinates meaning low variability of ponder-ing function values

• Non-zero values occur only for intervals with low indices (correspondingto low values of pondering function)

• Appear in just a few documents or, in almost every document (e.g. functionwords)

Dually, the significant terms are those that are not too common (but alsonot to rare), have high variability of values of the pondering function, with


many non-zero intervals, and at the same time the pondering function valuesare high (non-zero values of intervals with high indices).

Therefore, we define the significance of a term t in a given context, deter-mined by the set of documents D, as follows:

wt,D =1

Qt,D

∑q∈Θ

(q + 1) · log (Ht,D(q)) . (15)

The weight wt,D takes its values in the interval[0,∑

q∈Θ log (Ht,D(q))]

beinga subset of [0, Tt,D].

The above measure has an additional advantage that it can be computedat the low computational cost and can be updated in O(|d|) (where |d| is thenumber of distinct terms in the document d) when a document appear in ordisappear from a given subspace or context. It is very important propertyin case of incremental clustering of dynamically changing text/web data (seealso Sect. 4.5).

4.4 Determining the Degree of Membership of a Documentto a Context

A document fits well to a given contextual subspace if the distribution of somemeasurable features of term occurrence is typical for the “majority” of docu-ments in this space. Generally, we can look here at features like correlationsor co-occurrences of some terms or location-based statistics (e.g. deviation ofdistances between repeated occurrences of a term in the document contentfrom the expected number of occurrences under the assumption of term in-difference from a given context; key terms occurrence patterns should differfrom those of functional words).

Qualitative features could potentially be taken into account, like stylecharacteristics (dominant usage of a synonym or non-typical inflection) oreven features not directly related to the textual content (e.g. link structurebetween hypertext documents). In this paper we restrict ourselves to theanalysis of frequency of term occurrence and to a definition of “typicality”based on histograms of pondering function for individual terms in a givencontext. Hence we can talk about an approach similar to statistical maxi-mum likelihood estimation, in which, based on observed values, we constructa parametric function approximating parameters of an unknown conditionalprobability distribution f ∝ P (D|Θ). The likelihood function should maxi-mize the probability of observed data, and on the other hand it should highlyvaluate unseen data similar to ones in the training sample (the “general-ization” capability of the model). We proceed the same way in our case. Adocument is considered as “typical” for which the values of the ponderingfunction for the majority of terms are frequent ones in the given context.

Additionally, to avoid domination of the aggregated term-based functionevaluating document “typicality” by less important (but more numerous)


terms, the aggregation should take into account the formerly defined termsignificance in a given context. Therefore, the similarity (degree of member-ship) of the document d to the context determined by the document collectionD is defined as follows:

mf (d′,D) =

∑t∈d′ wt,D · H ′

t,D(q)∑t∈d′ wt,D

, (16)

where wt,D is the significance of a term (15), H ′t,D is the normalized histogram

for the term t (see Sect. 4.1), and q = ∆t,D (fDt, d′) is the sequential indexof the interval, determined for a fixed normalized pondering function fD anddiscretization ∆t,D (which transforms the value fDt, d′ into the index q). Thefunction mf (d′,D) takes its values in [0, 1].

It should be noted that the cost of computing the similarity functionmf (d′,D) is O(|d′|), and it is proportional to the number of distinct terms inthe document and equal to the complexity of the cosine measure calculation.

Having determined the similarity of a document to individual contextsin the contextual model, we obtain the vector of fuzzy memberships of adocument to the contexts, similarly to known methods of fuzzy clustering(e.g. Fuzzy-ISODATA). In the next section we explain, how such a vector isused to achieve incremental updates of the contextual clustering model.

4.5 Incremental Adaptation of Contexts

While the topic distribution within the stream of documents is dynamicallychanging in time (e.g. some Internet or intranet documents appear, disappearor have its content modified) also the contextual clustering models have to beadapted correspondingly. Such adaptation is performed both on the level ofindividual documents and the document clusters, represented by antibodies(or GNG cells, in case of GNG-based model). So a new document can beassigned to a context, and within it to an antibody. A modified documentmay be moved from one antibody to another, in the same, or in anothercontext. As a result, antibodies may not fit their original contextual immunemodel and it may be necessary to move them elsewhere, as a side effect of theso-called reclassification.

When a single new document is appearing, its similarity to every existingcontext is calculated by equation (16) and the document is assigned to its“most similar” context10. Whenever document context is modified, it mayeventually be removed from its previous context and assigned to a new one.

Important aspect of context adaptation is that the measure of contextualterm importance (cf. (15)) can be efficiently updated as documents are addedor removed from a given context. Constant-time update of the importance ofeach term t which appears in document d requires only to keep separately10 One could also consider assignment of a single document to more than one context,

i.e. fuzzy assignment.


numerator and denumerator from equation (15) and to update them ade-quately. Denominator is increased or decreased by one, while nominator byi + 1, where i is the index of the updated interval in the histogram Ht,D(q).Index i is computed by the discretization ∆t,D (ft,d′) (conf. Sect. 4.1).

After any of the contextual models has converged to a stable state, the re-classification procedure is applied. Each document group is represented by ref-erence vector within an antibody, which can be treated as a pseudo-documentdvi

. The similarity of dvito every other (temporally fixed) context is calcu-

lated (16). If the “most similar” context is different from the current contextthen the antibody (with assigned documents) is relocated to correspondingcontextual model. The relocated antibody is connected to the most stimu-lated antibody in the new immune model (and eventually merged with it inthe subsequent phases).

There is no room to go into details, so we only mention that also the wholecontext can be eventually incorporated within some other context, on the ba-sis of our between-context similarity measure, based on Hellinger divergence.Finally, we obtain incremental text data meta-clustering model, based bothon adaptive properties of modified clustering model (within-context adapta-tion, [8]) and on dynamically modified contexts, which allows for clusteringscalability and adaptation also on inter-context level.

5 Experimental Results

In the following sections, the overall experimental design as well as qualitymeasures are described. Since immune network can be treated both as a clus-tering and a meta-clustering (clusters of clusters) model, beside commonlyused clustering quality measures (unsupervised and supervised), we have alsoinvestigated immune network structure. The discussion of results is given inSects. 5.4–5.8.

5.1 Quality Measures of the Clustering

Various measures of quality have been developed in the literature, coveringdiverse aspects of the clustering process. The clustering process is frequentlyreferred as “learning without a teacher”, or “unsupervised learning”, and isdriven by some kind of similarity measure.

The optimized criterion is intended to reflect some esthetic preferences,like: uniform split into groups (topological continuity) or appropriate split ofdocuments with known a priori categorization. As the criterion is somehowhidden, we need tests if the clustering process really fits the expectations.In particular, we have accommodated for our purposes and investigated thefollowing well known quality measures of clustering [6, 28]:

Average document quantization. Average cosine distance (dissimilarity) for thelearning set between a document and the cell it was classified into. The goal


is to measure the quality of clustering at the level of a single cell: AvgDocQ =1|C|

∑c∈C

(1

|Dc|∑

d∈Dcdist(d, c)

), where Dc is the set of documents assigned

to the cell c.This measure has values in the [0,1] interval, the lower values correspond

respectively to more “smooth” inter-cluster transitions and more “compact”clusters. The two subsequent measures evaluate the agreement between theclustering and the a priori categorization of documents (i.e. particular news-group in case of newsgroups messages).

Average weighted cluster purity. Average “category purity” of a cell (cellweight is equal to its density, i.e. the number of assigned documents):AvgPurity = 1

|D|∑

c∈C maxk (|Dk,c|), where D is the set of all documentsin the corpus and Dk,c is the set of documents from category k assigned tothe cell c. Similarly, Average Weighted Cluster Entropy measure can be cal-culated, where the Dk,c term is replaced with the entropy of the categoriesfrequency distribution.

Normalized mutual information. The quotient of the entropy with respect tothe categories and clusters frequency to the square root of the product ofcategory and cluster entropies for individual clusters [6].

NMI =

∑C∈C

∑k∈K |Dk,c| log

(|Dk,c| |D||Dc| |Dc|

)√(∑

c∈C |Dc| log(

|Dc||D|

))(∑k∈K |Dk| log

(|Dk||D|

)) , (17)

where N is the set of graph cells, D is the set of all documents in the corpus,Dc is the set of documents assigned to the cell c, Dk is the set of all documentsfrom category k and Dk,c is the set of documents from category k assigned tothe cell c.

Again, both measures have values in the [0,1] interval. The higher the valueis, the better agreement between clusters and a priori given categories.

5.2 Quality of the Immune Network

Beside the clustering structure represented by cells, idiotypic network shouldbe also treated as a meta-clustering model. Similarity between individual clus-ters is expressed by graph edges, linking referential vectors in antibodies. Thus,there is a need to evaluate quality of the structure of the edges.

There is a number of ways to evaluate idiotypic model structure. In thispaper we present the one which we have found to be the most clear for inter-pretation. This approach is based on the analysis of the edge lengths of theminimal spanning tree (MST) constructed over the set of antibodies, in eachiteration of the learning process.


5.3 Histogram-Based Reclassification Measures

Each contextual model (as well as subgraph or map area) represents sometopically consistent (meta-)cluster of documents. Traditionally, such a clusteris represented by a single element (e.g. centroid, medoid, reference vectorin GNG/SOM, antibody in immune model). Alternative representation ofa group of documents have been presented in Sect. 4. It has been shown inSect. 4.2 that both computational and space complexity of such representationis low. It has numerous advantages such as abandoning of the assumption ofspherical shape of clusters and efficient adaptation during incremental learningon dynamically modified data sets. It also allows for the construction of variousmeasures for subspace clusters evaluation. Here we focus only on one suchmeasure, evaluating reclassification properties of contextual groups.

Reclassification aims at measuring stability of the existing structure ofthe clustering model (both on the meta-level of contexts and on the level ofdocument groups in some subgraphs and map areas). Reclassification mea-sures also the consistency of the histogram-based subspace description withthe model-based clustering. For the fixed clustering structure (e.g. some splitof the document collection into contexts) we can describe each cluster bya set of histograms, like in Sect. 4.1. Having such histograms built, we canclassify each document to its “most similar” histogram-based space, like inSect. 4.4. Finally, we can investigate the level of agreement between original(model-based) and the new (histogram-based) grouping.

In the ideal case we expect both groupings to be equal. To assess how farfrom ideal agreement two groupings are, we construct contingency table. Sincethe group indexes in original and the new grouping are left unchanged, cor-rectly reclassified objects appear in the diagonal elements of the contingencymatrix. Finally, we can calculate measures traditionally used for evaluation ofclassifiers performance (precision, recall, F-statistics, etc.).

In general case, to take the meta-clustering information into account, wediscriminate between different kind of misclassifications. Since contexts andsubspaces are also similar on meta-level and this similarity is reflected by thegraph edges, we exploit shortest-path algorithm for weighted graphs with non-negative weights (Diskstra algorithm). Likewise binary agreement approach,proper reclassification is assigned with distance equal to 0. Each improperreclassification gets the distance equal to the sum of edges weights on theshortest path between model-based and histogram-based cluster (i.e. contextor cell in the graph).

5.4 Experimental Settings

The architecture of BEATCA system supports comparative studies of cluster-ing methods at the various stages of the process (i.e. initial document group-ing, initial topic identification, incremental clustering, graph model projectionto 2D map and visualization, identification of topical areas on the map and its


labeling) – consult [18] for details. In particular, we conducted series of exper-iments to compare the quality and stability of AIS, GNG and SOM modelsfor various model initialization methods, cell/antibody search methods andlearning parameters [18]. In this paper we focus only on the evaluation andcomparison of the immune models.

This study required manually labelled documents, so the experiments wereexecuted on a widely-used 20 Newsgroups document collection11 of approxi-mately 20 thousands newsgroup messages, partitioned into 20 different news-groups (about 1,000 messages each). As a data preprocessing step in BEATCAsystem, entropy-based dimensionality reduction techniques are applied [16],so the training data dimensionality (the number of distinct terms used) was4419.

Each immune model have been trained for 100 iterations, using previ-ously described algorithms and methods: contexts extraction (Sect. 2), ag-glomerative identification of redundant antibodies [18], robust construction ofthe mutated antibodies (Sect. 3.3), time dependent parameters (Sect. 3.4) andCF-Tree based antibody search method [7].

5.5 Impact of the Time-Dependent Parameters

In the first two series of experiment, we compared models built with time-dependent parameters σs(t) and σd(t) with the constant, a priori definedvalues of σs and σd. As a reference case we took a model where σs(t) waschanged from the initial value 0.05 up to 0.25 and σd(t) from 0.1 up to 0.4(cf. Sect. 3.4).

First, we compare the reference model and the four models with constantσd. Parameter σs has been changed identically as in reference model. Thevalues of σd varied from the starting value in the reference model (0.1) up tothe final value (0.4) by 0.1 step. The results12 are presented in Fig. 1.

Figure 1a presents variance of the edge length in the minimal spanning treebuilt over the set of antibodies in the immune memory in ith iteration of the

Val

ue

20 40 60 80 100

0.20

0.15

0.10

0.05

0.00

sigma (t)sigma=0.1sigma=0.2sigma=0.3sigma=0.4

Iterations

20 40 60 80 100

Val

ue20

015

010

050

0


Iterations

20 40 60 80 100

Val

ue1.

00.

80.

60.

40.

20.

0


Iterations

20 40 60 80 100

Val

ue10

86

42

0


Iterations

Fig. 1. Time-dependent σd: (a) edge length variance (b) network size (c)quantization error (d) learning time

11 http://people.csail.mit.edu/jrennie/20Newsgroups/.12 All figures present average values of the respective measures in 20 contextual nets.


learning process. At first glance one can notice instability of this measure forhigh values of σd. Comparing stable values, we notice that the variance for thereference network has the highest value. It means that the idiotypic networkcontains both short edges, connecting clusters of more similar antibodies andlonger edges, linking more distant antibodies, probably stimulated by differentsubsets of documents (antigens). Such meta-clustering structure is desirableand preferred over networks with equidistant antibodies (and, thus, low edgelength variance).

Comparing network sizes, Fig. 1b, and quantization error, Fig. 1c, we ob-serve that for the highest values of σd, the set of antibodies reduces to justa few entities; on the other hand - for the lowest values almost all antibodies(universal and over-specialized) are retained in the system’s memory. It is notsurprising that the quantization error for a huge network (e.g. σd = 0.1) ismuch lower than for smaller nets. Still, the time-dependent σd(t) gives simi-larly low quantization error for moderate network size. Also, both measuresstabilize quickly during learning process. Learning time, Fig. 1d, is – to someextent – a function of network size. Thus, for the reference model, it is notonly low but very stable over all iterations.

In the next experiment – dually – we compare reference model and anotherfive models with constant σs (and varying σd). Analogically to the first case,the values of σs varied from the initial value 0.05 up to the final value in thereference model 0.25 by 0.05 step. The results are presented in Fig. 2. Due tothe space limitations, we restrict the discussion of the results to the conclusionthat also in this case time-dependent parameter σs(t) had a strong, positiveinfluence on the resulting immune model.

A weakness of the approach seems to be the difficulty in selectingappropriate values of the parameters for a given dataset. We investigatedindependently changes to the values of both parameters, but it turns out thatthey should be changed “consistently”; that is the antibodies should not beremoved too quickly, nor aggregated too quickly. However, once found, thereis a justified hope that for an incrementally growing collection of documentsthe parameters do not need to be sought anew, but rather gradually adopted.

20 40 60 80 100

Val

ue0.

100.

080.

060.

040.

020.

00

Iterations

sigma (t)sigma=0.05sigma=0.10sigma=0.15sigma=0.20sigma=0.25

Iterations

20 40 60 80 100

Val

ue30

025

015

020

010

050

0


Iterations

20 40 60 80 100

Val

ue0.

50.

40.

30.

20.

10.

0


Iterations

20 40 60 80 100

Val

ue12

108

64

20


Fig. 2. Time-dependent σs: (a) edge length variance (b) network size (c) quantiza-tion error (d) learning time


5.6 Scalability and Comparison with Global Models

Comparing hierarchical and contextual models described in Sect. 2, with a“flat”, global model the most noticeable difference is the learning time13. Thetotal time for 20 contextual networks accounted for about 10 min, against over50 min for hierarchical network and almost 20 h (sic!) for a global network.Another disadvantage of the global model is high variance of the learningtime at single iteration as well as the size of the network. The learning timevaried from 150 to 1,500 s (10 times more!) and the final network consistedof 1,927 antibodies (two times more than for contextual model). It shouldalso be noted that in our experimental setting, each model (local and global)has been trained for 100 iterations, but it can be seen (e.g. Fig. 5) that thelocal model stabilizes much faster. Recalling that each local network in thehierarchy can be processed independently and in parallel, it makes contextualapproach robust and scalable14 alternative to the global immune model.

One of the reasons for such differences of the learning time is the rep-resentation of antibodies in the immune model. The referential vector in anantibody is represented as a balanced red-black tree of term weights. If a singlecell tries to occupy “too big” portion of a document-term vector space (i.e. itcovers documents belonging to different topics), many terms which rarely co-occur in a single document have to be represented by a single red-black tree.Thus, it becomes less sparse and – simply – bigger. On the other hand, betterseparation of terms which are likely to appear in various topics and increasing“crispness” of topical areas during model training leads to faster convergenceand better models, in terms of previously defined quality measures. While thequantization error is similar for global and contextual model (0.149 vs. 0.145,respectively), then both supervised measures – showing correspondence be-tween documents labels (categories) and clustering structure – are in favor tocontextual model. The final value of the Normalized Mutual Information was0.605 for the global model and 0.855 for the contextual model and AverageWeighted Cluster Purity: 0.71 vs. 0.882, respectively.

One can also observe the positive impact of homogeneity of the distributionof term frequencies in documents grouped to a single antibody. Such homo-geneity is – to some extent – acquired by initial split of a document collectioninto contexts. Another cause of the learning time reduction is the contextualreduction of vector representation dimensionality, described in Sect. 2.

It can be seen that model stabilizes quite fast; actually, most models con-verged to final state in less than 20 iterations. The fast convergence is mainlydue to topical initialization. It should also be noted here that the proper top-ical initialization can be obtained for well-defined topics, which is the case incontextual model.13 By learning time we understand the time needed to create an immune memory

consisting of the set of antibodies representing the set of antigens (documents).14 Especially with respect to growing dimensionality of data, what – empirically –

seem to be the most difficult problem for immune-based approach.


20 40 60 80 100

Val

ue10

0080

060

040

020

00

Iterations

u

u−su−.5s

u+.5su+s

idiotype netclique graph

Val

ue1.

00.

80.

60.

40.

20.

020 40 60 80 100

Iterations

Val

ue0.

50.

40.

30.

20.

10.

0

20 40 60 80 100

Iterations

Val

ue15

0010

0050

00

20 40 60 80 100

Iterations

Fig. 3. Global model: (a) edge length distribution (b) clique average edge length(c) quantization error (d) learning time

Iterations

Val

ue0.

80.

60.

40.

20.

0

20 40 60 80 100

C−SOMSOMC−AIS

Val

ue80

6040

200

Iterations20 40 60 80 100

u

u−su−.5s

u+.5su+s

Val

ue1.

00.

80.

60.

40.

20.

0

Iterations

20 40 60 80

C−AISC−SOMSOM

Fig. 4. Immune model vs. SOM: (a) quantization error (b) SOM (MST on SOMgrid) edge length distribution (c) average edge length

We have also executed experiments comparing presented immune approachwith SOM models: flat (i.e. standard, global Kohonen’s map) and our ownvariant of contextual approach – the hierarchy of contextual maps (C-SOM).To compare immune network structure, with the static grid of SOM model,we have built minimal spanning tree on the SOM grid. Summary of the re-sults can be seen in Fig. 4. Again, global model turned out to be of lowerquality than both contextual SOM and contextual AIS model. Similarly tothe global immune model, also in this case the learning time (over 2 h) wassignificantly higher than for the contextual models. Surprisingly, the averageedge in contextual SOM model was much longer than in case of contextualimmune network and standard SOM, what may be the result of the limita-tions of the rigid model topology (2D grid). The discussion of the edge lengthdistribution (Fig. 4b) we defer to the Sect. 5.8.

5.7 Contextual Vs. Hierarchical Model

The next series of experiments compared contextual model with hierarchicalmodel. Figures 5a,b presents network sizes and convergence (wrt AverageDocument Quantization measure) of the contextual model (represented byblack line) and hierarchical model (grey line).

Although convergence to the stable state is fast in both cases and thequantization error is similar, it should be noted that this error is acquired


Iterations

contextualhierarchical

Val

ue

200

180

160

140

120

100

20 40 60 80 100

Val

ue

0.5

0.4

0.3

0.2

0.1

0.0


contextualcontextual [test]hierarchicalhierachical [test]

Fig. 5. Contextual vs. hierarchical model: (a) network size (b) quantization error

Iterations

0.0

0.2

0.4

0.6

0.8

1.0

Val

ue

20 40 60 80 100

idiotypic net [C]clique graph [C]idiotypic net [H]clique graph [H]

020

4060

8010

0V

alue


u−su−.5suu+.5su+s

020

4060

8010

0V

alue


u

u−su−.5s

u+.5su+s

20 40 60 80 100

Val

ue10

0080

060

040

020

00

Iterations

u

u−su−.5s

u+.5su+s

Fig. 6. Edge length distrib.: (a) complete (b) contextual (c) hierarchical (d) globalnet

for noticeably smaller network in contextual case (and in shorter time, asmentioned in previous section).

However, the most significant difference is the generalization capability ofboth models. For this experiment, we have partitioned each context (groupof documents) into training and test subsets (in proportion 10:1). Trainingdocuments were used during learning process only, while the quantization errorwas computed for both subsets. The results are shown in Fig. 5b – respectivelearning data sets are depicted with black lines while test data sets with greylines. Nevertheless quantization error for learning document sets are similar,the difference lies in test sets and the hierarchical network is clearly overfitted.Again, there’s no room to go into detailed study here, but it can be shown thatthis undesirable behavior is the result of the noised information brought byadditional terms, which finally appears to be not meaningful in the particularcontext (and thus are disregarded in contextual weights wdtG).

5.8 Immune Network Structure Investigation

To compare robustness of different variants of immune-based models, in eachlearning iteration, for each of the immune networks: contextual (Fig. 6b),hierarchical (Fig. 6c), global (Fig. 6d) and MST built on SOM grid (Fig. 4c),the distributions of the edge lengths have been computed. Next, the aver-age length u and the standard deviation s of the length have been calcu-lated and edges have been classified into five categories, depending on theirlength, l: shortest edges with l ≤ u−s, short with l ∈ (u−s, u−0.5s], medium


with l ∈ (u−0.5s, u+0.5s], long with l ∈ (u+0.5s, u+s] and very long edgeswith l > u+s. First category consists of the edges no longer that u−s, i.e.the shortest edges. Second category contains edges with lengths in interval(u−s, u−0.5s], next – “average length” edges between u−0.5s and u+0.5s.Last two categories contain longer edges: 4th – edges shorter that u + s andthe last one – longer than u+s.

Additionally, in Fig. 6a, we can see average length of the edges for hierar-chical and contextual immune networks (dashed and solid black lines, respec-tively) and complete graphs on both models’ antibodies (cliques - depictedwith grey lines). Actually, in both cases clustering structure has emerged andthe average length of the edge in the immune network is much lower than inthe complete graph. However, the average length for the contextual networkis lower, whereas variance of this length is higher. It signifies more explicitclustering structure.

There are quite a few differences in edge length distribution. One can noticethan in all models, the number of shortest edges diminishes with time. It iscoherent with the intention of gradual elimination of the redundant antibodiesfrom the model. However, such elimination is much slower in case of the globalmodel, what is another reason of slow convergence and high learning time.Also in case of SOM model, which has a static topology and no removalof inefficient cells is possible, we can see that the model slowly reduces thenumber of redundancies, represented by too similar referential vectors.

On the extreme side, the dynamics of the longest edges’ distribution issimilar in case of the contextual and the global model, but distinct in caseof the hierarchical model. This last contains much more very long edges.Recalling that the variance of the edge lengths has been low for this modeland the average length has been high, we can conclude that hierarchical modelis generally more discontinuous. The same is true for the SOM model, whichis another indication of the imperfection of the static grid topology.

6 Related Work

Application of artificial immune systems, especially of the aiNet algorithm,to document clustering is not new. For example, in [22], integrates principalcomponent analysis (PCA) with the original aiNet to reduce the time com-plexity, with results preferential to hierarchical agglomerative clustering andK-means. A novel hierarchical method of immune system application to textprocessing was suggested in [5]. In our opinion, our method offers a radicalimprovement, reducing the space dimensionality not only formally, but alsointrinsically – in terms of complexity of the document description (we do notgenerate denser description vectors like the PCA). An important difference toboth of these approaches is our histogram representation of document spacedimensions. It allows to represent not only the document affinity, but also thediversity.


The aiNet approach to document clustering belongs to a wider family ofclustering techniques that can be called “networked clustering” as the clustersare not independent, but rather form a network of more or less related items.Other prominent technologies in this class include self-organizing maps [19],growing neural gas [11] and similar techniques. Like aiNet, also this category ofsystems suffers from problems of scalability. Our idea of contextual clusteringcan serve as a remedy for the performance of theses systems also.

The networked clustering methods lack generally good measures of quality.Neither the supervised (external) quality measures nor unsupervised (inter-nal) ones derived for independent clusters do not reflect the quality of linkstructure between the clusters. Kohonen proposed therefore map continuitymeasures, based on differences between neighboring map points. In this chap-ter we presented a generalization of his proposals taking into account otherstructures than the pure planar map.

Let us note that some document classification methods like ones based onID3 of Quinlan use in some sense “contextual” information when selectingthe next attribute (term) for split of document collection. Our contextualmethod exploits the split much more thoroughly, extracting more valuablelocal information about the term collection.

Many document clustering techniques suffer from the tendency of formingspherical clusters. The histogram-based characterization of clusters allows fordeparting from this shortcoming not only in case of artificial immune systems.

7 Concluding Remarks

The contextual model described in this paper admits a number of interestingand valuable features in comparison with global and hierarchical models usedtraditionally to represent a given collection of documents. Further, when ap-plying immune algorithm to clustering the collection of documents, a numberof improvements was proposed. These improvements obey:

• Identification of redundant antibodies by means of the fast agglomerativeclustering algorithm [18].

• Fast generation of mutated clones without computation of their stimula-tion by currently presented antigen. These mutants can be characterizedby presumed ability of generalization (cf. Sect. 3.3).

• Time-dependent parameters σd and σs. In general we have no a recipeallowing to tune both the parameters to a given dataset. In originalapproach [10] a trial-and-error method was suggested. We observed that inhighly dimensional space the value of σd is almost as critical as the valueof σs. Hence we propose a “consistent” tuning of these parameters – cf.Sect. 3.4. The general recipe is: carefully (i.e. not to fast) remove weaklystimulated and too specific antibodies and carefully splice redundant (toosimilar) antibodies.


• Application of the CF-trees [27] for fast identification of winners (moststimulated memory cells) [7].

With these improvements we proposed a new approach to mining highdimensional datasets. The contextual approach described in Sect. 2 appearsto be fast, of good quality (in term of indices introduced in Sects. 5.1 and 5.2)and scalable (with the data size and dimension).

Clustering high dimensional data is both of practical importance and atthe same time a big challenge, in particular for large collections of text docu-ments. The paper presents a novel approach, based on artificial immune sys-tems, within the broad stream of map type clustering methods. Such approachleads to many interesting research issues, such as context-dependent dictio-nary reduction and keywords identification, topic-sensitive document summa-rization, subjective model visualization based on particular user’s informationrequirements, dynamic adaptation of the document representation and localsimilarity measure computation. We plan to tackle these problems in our fu-ture work. It has to be stressed that not only textual, but also any other highdimensional data may be clustered using the presented method.

Acknowledgements

This research was partially supported by the European Commission and bythe Swiss Federal Office for Education and Science with the Sixth FrameworkProgramme project REWERSE no. 506779 “Reasoning on the Web with Rulesand Semantics” (cf. http://rewerse.net). Krzysztof Ciesielski was partiallysupported by MNiSW grant no. N516 005 31/0646.

References

1. A. Baraldi and P. Blonda. A survey of fuzzy clustering algorithms for patternrecognition. IEEE Trans. on Systems, Man and Cybernetics, 29B:786–801, 1999.

2. A. Becks. Visual knowledge management with adaptable document maps. GMDresearch series, 15, 2001.

3. M.W. Berry, Z. Drmac, and E.R. Jessup. Matrices, vector spaces and informa-tion retrieval. SIAM Review, 41(2):335–362, 1999.

4. J.C. Bezdek and S.K. Pal. Fuzzy Models for Pattern Recognition: Methods thatSearch for Structures in Data. IEEE, New York, 1992.

5. G.B.P. Bezerra, T.V. Barra, M.F. Hamilton, and F.J. von Zuben. A hierarchicalimmune-inspired approach for text clustering. In Proc. Information Processingand Management of Uncertainty in Knowledge-Based Systems (IPMU’2006),volume 1, pages 2530–2537, 2006.

6. C. Boulis and M. Ostendorf. Combining multiple clustering systems. In Proc.of 8th European Conference on Principles and Practice of Knowledge Discoveryin Databases (PKDD-2004), pages 63–74. Springer-Verlag, LNAI 3202, 2004.


7. K. Ciesielski, M. Draminski, M. K�lopotek, D. Czerski, and S.T. Wierzchon.Adaptive document maps. In Proceedings of the Intelligent Advances in SoftComputing 5, pages 109–120. Springer-Verlag, 2006.

8. K. Ciesielski and M. K�lopotek. Text data clustering by contextual graphs. InL. Todorovski, N. Lavrac, and K.P. Jantke, editors, Discovery Science, pages65–76. Springer-Verlag, LNAI 4265, 2006.

9. L.N. de Castro and J. Timmis. Artificial Immune Systems: A New Computa-tional Intelligence Approach. Springer, 2002.

10. L.N. de Castro and F.J. von Zuben. An evolutionary immune network for dataclustering. In SBRN’2000, pages 84–89. IEEE Computer Society Press, 2000.

11. B. Fritzke. Some competitive learning methods, 1997. http://www.

neuroinformatik.ruhr-uni-bochum.de/ini/VDM/research/gsn/JavaPaper.12. M. Gilchrist. Taxonomies for business: Description of a research project. In 11

Nordic Conference on Information and Documentation, Reykjavik, Iceland, May30 – June 1 2001. http://www.bokis.is/iod2001/papers/Gilchrist paper.

doc.13. C. Hung and S. Wermter. A constructive and hierarchical self-organising model

in a non-stationary environment. In Int. Joint Conference in Neural Networks,pages 2948–2953, 2005.

14. S.Y. Jung and K. Taek-Soo. An incremental similarity computation methodin agglomerative hierarchical clustering. Journal of Fuzzy Logic and IntelligentSystem, December 2001.

15. M. K�lopotek. A new bayesian tree learning method with reduced time and spacecomplexity. Fundamenta Informaticae, 49(4):349–367, 2002.

16. M. K�lopotek, M. Draminski, K. Ciesielski, M. Kujawiak, and S.T. Wierzchon.Mining document maps. In M. Gori, M. Ceci, and M. Nanni, editors, Proc. ofStatistical Approaches to Web Mining Workshop (SAWM) at PKDD’04, pages87–98, Pisa, Italy, 2004.

17. M. K�lopotek, S. Wierzchon, K. Ciesielski, M. Draminski, and D. Czerski. E-Service Intelligence – Methodologies, Technologies and Applications. Part II:Methodologies, Technologies and Systems, volume 37 of Studies in Computa-tional Intelligence, chapter Techniques and technologies behind maps of Internetand Intranet document collections, pages 169–190. Springer, 2007.

18. M. K�lopotek, S. Wierzchon, K. Ciesielski, M. Draminski, and D. Czerski. Con-ceptual Maps of Document Collections in Internet and Intranet. Coping with theTechnological Challenge. IPI PAN Publishing House, Warszawa 2007.

19. T. Kohonen. Self-Organizing Maps, volume 30 of Springer Series in InformationSciences. Springer, Berlin, Heidelberg, New York, 2001.

20. K. Lagus, S. Kaski, and T. Kohonen. Mining massive document collections bythe WEBSOM method. Information Sciences, 163/1-3:135–156, 2004.

21. G. Salton. The SMART Retrieval System – Experiments in Automatic DocumentProcessing. Prentice-Hall, Upper Saddle River, NJ, USA, 1971.

22. N. Tang and V.R. Vemuri. An artificial immune system approach to documentclustering. In Proceedings of the 2005 ACM symposium on Applied ComputingSanta Fe, New Mexico, pages 918–922, 2005.

23. J. Timmis. aiVIS: Artificial immune network visualization. In Proceedings ofEuroGraphics UK 2001 Conference, pages 61–69. Univeristy College, London,2001.

24. C.J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.html.


25. S.T. Wierzchon. Artificial immune systems. Theory and applications (in Polish).Akademicka Oficyna Wydawnicza EXIT Publishing, Warszawa, 2001.

26. D.R. Wilson and T.R. Martinez. Reduction techniques for instance-based learn-ing algorithms. Machine Learning, 38:257–286, 2000.

27. T. Zhang, R. Ramakrishan, and M. Livny. Birch: Efficient data clusteringmethod for large databases. In Proc. ACM SIGMOD Int. Conf. on Data Man-agement, pages 103–114, 1997.

28. Y. Zhao and G. Karypis. Criterion functions for document clustering: Experi-ments and analysis. http://www-users.cs.umn.edu/∼karypis/publications/ir.html.

Critical Feature Detection in Cockpits –Application of AI in Sensor Networks

S. Srivathsan1, N. Balakrishnan2 and S.S. Iyengar1

1 Computer Science Department, Louisiana State University, Baton Rouge, LA70803, [email protected], [email protected]

2 Associate Director of IISc and Professor of Supercomputer Education andResearch Center, Indian Institute of Science, Bangalore, [email protected]

Summary. We highlight some safety issues in commercial planes particulary fo-cussing on hazards in the cockpit area. This chapter discusses a few methodologiesto detect critical features and provide unambiguous information about the possi-ble sources of hazards to the end user in near realtime. We explore the applicationof Bayesian probability, Iyengar–Krishnamachari method, Probabilistic Reasoning,Reasoning under Uncertainty, Dempster–Shafer Theory and analyze how these theo-ries could help in the data analysis gathered from wireless sensor networks deployedin the cockpit area.

1 Introduction and Motivation

Incidents of inflight fires and smoke are rare events. But, they are to be treatedseriously because the consequences could be severe. A look at the history ofaircraft emergencies and disasters due to unknown source of fire and smoketells clearly about the pressing need for applying state-of-the-art technolo-gies in helping the aircraft crew for real-time detection, analysis and report-ing of anomalies in aircraft cabins such as passenger cabins, toilet, cockpitand cargo-bay areas. Reliable detection, monitoring and suppression of in-flight smoldering fire/flaming fire/smoke is essential because of the inabilityto quickly land. Furthermore, for long-haul and trans-continental flights, fail-ure of such detection mechanisms could be severe.

1.1 Studies by Transport Authorities

The International Air Transport Association (IATA) has estimated that morethan 1,000 in-flight smoke events occur annually, resulting in more than350 unscheduled or precautionary landings [15]. In-flight smoke events are

S. Srivathsan et al.: Critical Feature Detection in Cockpits – Application of AI in Sensor

Networks, Studies in Computational Intelligence (SCI) 96, 409–433 (2008)


410 S. Srivathsan et al.

estimated at a rate of one in 5,000 flights while in-flight smoke diversions areestimated to occur on one in 15,000 flights [3, 16].

Despite the requirement that passenger aircrafts are required to have firedetection systems that would provide a visible indication of the start of thefire to the flight crew within a minute of the start of the fire [2], the system inplace for detection, localization, analysis and monitoring are unsatisfactory.Some of the reasons for this are:

• Current use of unsophisticated fire detectors (photoelectric and ionizationsmoke detectors).

• Fires often start at inaccessible places in the aircraft.• Unreliable fire and smoke detectors (many false alarms).• Lack of standardization regarding exactly what the detection system is

supposed to detect within 1 min.• Most of the fire/smoke detection systems are live tested in cargo compart-

ments only. Cockpit testing has been minimal.• Lack of certification guidelines for using better fire and smoke detectors

available today.• Lack of robust coupling among the sensors and the avionics/visual display

to help the crew.

A combination of the above factors would make a simple, easy to detecthazard manifest itself in a highly complex manner. This is where the motiva-tion lies for exploring better mathematical tools and models for segregatingthe possible causes and providing a quantifiable value to the possible cases.

Multimodal sensors have not been previously used, but, it holds thepromise to provide more accurate measurements while reducing the rate offalse alarms. False alarms are caused due to mist, dust, oil particles, tem-perature variations, air-pressure variations and more. Despite the presence ofbetter technologies available today, they still have to undergo rigorous stan-dardization and testing processes. And, for that, the FAA and other regulatorybodies should come up with acceptable threshold alarm levels, number of sen-sors that should trigger or suppress alarms, etc. This installed system mustbe capable of detecting anomalies at a temperature significantly below thatat which the avionics and the structural integrity are compromised. A physicsbased CFD simulation tool could be used for predicting the flow of smoke,heat, gases and to identify potential locations to install sensors and thresholdlevels. Tiny, light-weight wireless sensors with appropriate sensors could beinstalled at inaccessible places of an aircraft and integrated with the avionicsand display system for early warning and prediction.

Promising technologies are available today for a spectrum of applicationsto better detect, predict, monitor and report smoke/fire related anomaliesand provide better and time-critical decision making capabilities to the crewby discriminating between varying conditions of threats. The technologies in-volved are the growing field of wireless sensor networks (a class of ad hocwireless distributed systems sprung from advancement and miniaturization in

Critical Feature Detection in Cockpits 411

technology), sophisticated signal processing mechanisms, better understand-ing of fire physics and improvement in CFD (Computational Fluid Dynamics)simulations.

1.2 Computational Intelligence

Computational Intelligence is described as follows [4, 11, 12]: “Computation-ally intelligent system should be characterized with the capability of compu-tational adaptation, fault-tolerance, high computational speed and less error-prone to noisy information sources. Computational adaptation means that thesystem should be capable of changing its parameters following some guidelines(of optimizing criteria) and depending on the temporal changes in its inputand output instances. Most of the ANN models satisfy this characteristic.Computational fault tolerance is a general characteristic of a parallel anddistributed system. In a parallel and/or distributed system, computationalresources like variables, procedures and software tools are usually replicatedat the distributed units of the computing system. So, a damage of a few unitsusually does not cause malfunctioning of the entire system, as the same re-sources are available at other units. While ANN and fuzzy logic have theirinherent characteristics of fault tolerance, GA and belief networks too can beconfigured in a parallel fashion to provide users the benefit of fault tolerance.”

Computational Intelligence has various tools such as fuzzy logic, rough settheory, neural network, generic algorithm, belief network, chaos theory, com-putational learning theory and artificial life. The synergistic behavior of theabove tools on many occasions far exceeds their individual performance [4]. Wefocus on probabilistic reasoning and Demspter-Shafer Theory in this chapter.

1.3 Contribution

This chapter is about how distributed wireless sensor networks equipped withsophisticated intelligence can effectively monitor, diagnose, track, analyze andpredict in real-time about inflight smoke and fires (preferably in the cockpit).The primary focus of this article is to provide possible methodologies to deter-mine system/component degradation and damage early enough to prevent orgracefully recover from in-flight failures in air-transport systems. We presentour research studies (with our simulation results) and other theoretical con-cepts which would prove useful in such applications. In Sect. 2, we discussabout the cockpit threats with some real-world examples, in Sect. 3, we talkabout distributed sensor networks and in Sect. 4 we discuss about a few tech-niques such as Iyengar–Krishnamachari method, reasoning under uncertainty,data analysis using Dempster–Shafer theory that could be used along withsensor networks to reliably and efficiently detect and monitor inflight cockpithazards.


2 Related Work

NASA has done extensive work on this aspect of vehicle health manage-ment [5]. For the re-usable launch vehicle X-37, the IVHM (Integrated VehicleHealth Management) software was built to perform real-time fault detectionand isolation for X-37’s electrical power system and electro-mechanical actua-tors. This experiment used a software called Livingston (developed at NASAAmes Research Center), which is capable of diagnosis using a qualitative,model-based reasoning approach. The methodologies used were very promis-ing, but, the experiment was suspended in 2002 due to lack of funding. TheIVHM system (for X-37) was capable of monitoring the inflight RLV usingsensor data from selected subsystems, perform real-time fault detection andisolation and identify potential recovery actions. NASA does have a biggerIVHM project [6] which is underway now and projected to be completedby 2011. Northrop Grumman does have similar project [7] which uses MBR(Model Based Reasoning) to analyze large, complex systems.

In [8], a distributed detection scheme is presented which uses data fusiontechniques for monitoring the function of redundant processing channels ofa flight critical control system during operation. The system is subjected toclosed-loop High-Intensity Radiated Fields (HIRF) system effects with otherconditions such as heavy clean air turbulence. The authors provide a dis-tributed control law calculation malfunction detection scheme to monitoringthe integrity of a flight control computer with redundant processing elements.NASA’s Aviation Safety Program has developed and continues to develop sev-eral technologies to reduce aircraft accidents due to vehicle loss of control andsystem failures.In [9], there is an overview of those technologies and focuseson technologies for SAAP (Single Aircraft Accident Project).

3 Cockpit Threats

The Federal Aviation Administration is beginning to address an issue thathas not been very high in their priority in earlier years - the inflight dan-ger that occurs very often: smoke in the cockpit. This is usually caused dueto dense intertwined electrical wires, fluid leaks and other unknown factors.Such events have caused forced landings and redirections nearly once a dayin an average [13]. Despite knowing the frequency of such occurrences andthe consequences, pilots still lack an effective way to deal with smoke and firein cockpit. One famous example where smoke and fire caused the crash of aplane is that of Swiss Air 111, off the coast of Peggy’s Cove, Nova Scotia onSeptember 02, 1998. Figure 1 (borrowed from [14]) shows the fire damage onthe insulation blankets of Swiss Air 111 at 850 seconds just after the CabinBus was switched OFF. Had there been numerous, redundant, tiny wirelesssensors installed in and around the cockpit (as indicated in part(b)), the fireand smoke could have been detected very early, the enormity of the situation


Fig. 1. Fire spread at (a) 850 s and (b) 1,252 s

could have been well known for proper decision making by the crews. Thesame was blamed for the crash of ValuJet 592 on May 11, 1996 at BrownsFarm Wildlife Management Area of Everglades, Florida.

Other notable examples are the following (from [1]):

1. “1947, October 24th. A United Airlines DC-6 crashed, while attemptingto make an emergency landing at Bryce Canyon, Utah. They almost madeit, but the fire burned through the controls just short of the airport, killingall 52 on board.”

2. “1973, July 11th. A Varig Boeing 707, enroute from Rio de Janeiro toParis, was forced to land short of the runway at Orly airport, only 5 minafter reporting a fire in the rear of the cabin. The smoke was so thickin the cockpit that the pilot had to look out the opened side windowsto make the crash landing. He could not see his instrument panel or outthe front windshield. Of the 134 on board, only the 3 pilots, 7 cabin crewand 1 passenger survived. All others were asphyxiated and burned. Theaccident report found the probable cause to be a fire that originated in thewashbasin unit of the aft right toilet, either as a result of an electrical faultor by the carelessness of a passenger. [Editor’s translation: a passengersmoked in the blue room and then threw the lighted cigarette into thetrash can.]”

3. “1973, November 3rd. A Pan American 707-321C cargoliner, crashed, justshort of the runway, at Boston Logan International Airport, killing thethree pilots on board. Only 30 min after taking off from New York’s JFKAirport, the pilot reported smoke in the cockpit. The smoke became sothick that it“. . .seriously impaired the flight crew’s vision and ability tofunction effectively during the emergency” The captain had not been no-tified that hazardous cargo was aboard.”

It is well known that cockpit smoke and fire do not necessarily start inthe cockpit. It could start elsewhere (maybe in the toilet, above the ceiling,cargo area, entertainment system area where there is dense electronics and


Air Carrier Events Reporting Smoke in the Cockpit or Cabin from February to April 2006

0

Contam

inan

ts

Deicin

g Flu

id C

ontam

inat

ion

Electri

cal

Engine

False W

arnin

gFilt

er

Fire D

etec

tion

Mechan

ical

No Fau

lt Found

Non Airc

raft

Syste

m

Unknown

Grand T

otal

Suspected Cause

Nu

mb

er o

f E

ven

ts

Feb-06

Mar-06

Apr-06

1020

3040

5060

70

Fig. 2. Suspected causes of smoke in cockpit or cabin – figure reproduced from FAA– US/Europe International Aviation Safety Conference, by Jim Ballough, June 8,2006

Air Carrier Events Reporting Smoke in the Cockpit or Cabin fromFebruary to April 2006

APU

Electri

cal

Enterta

inm

ent S

yste

m

Fire D

etec

tion

Hydra

ulics

Inte

rior L

ight

Unknown

Total C

ount of S

yste

m

Systems

Nu

mb

er o

f E

ven

ts

Feb-06

Mar-06

Apr-06

0

10

20

30

40

50

60

70

Fig. 3. Systems causes for smoke in cockpit or cabin – figure Reproduced from FAA– US/Europe International Aviation Safety Conference, by Jim Ballough, June 8,2006

wires, etc.) and gradually spread through the ventilation system. Figure 2provides an indication of the number of systems that were potential suspectsfor hazardous events in the cockpit/cabin and Fig. 3 shows the actual systems


that were later found to be involved in the event. The mere number of dis-parate systems that could be involved in any hazard would confuse the crewin narrowing down the issue/source. Fire and smoke are also known to spreadvery fast due to the presence of flammable insulation materials, conductiveand non-conductive wire ties, flammable dust, grime and lint-like debris, etc.In order to reduce the weight of the wires, the manufacturers have lightenedthe insulation materials. These insulation materials have cracked and exposedthe raw wires which has led to arcing. This arcing combined with combustiblematerials near it (for e.g., the insulation material itself) causes self-sustainingfire [3]. This is one of the potential causes of fire in or near the cockpit.Typically, wires of disparate electrical equipments are bundled together whilerouting and when such dense bundles of wires burn or arc, seemingly unre-lated systems of the aircrafts fail which leads to confusion when the crew isanalyzing the situation. The Swiss Air Flight 111 had the exact same prob-lem of arcing and burning of the inflight entertainment system wires thatcaused melting which provided a conductive path for electrical power to otherwires [3]. Extensive study has revealed that the most frequent source of firein passenger planes is related to electrical faults. A Boeing study showed thatbetween November 1992 and June 2000 that almost two thirds of the in-flightfires on Boeing aircraft were electrical [20].

On December 08, 2005, fire broke out in the avionics equipment below thepilot’s seat on a plane (ComAir Bombardier Canadair Regional Jet (CRJ-200)) from Cincinnati, Ohio. This caused the loss of all the electronic flight

Air Carrier Events Reporting Smoke in the Cockpit or Cabin from

February to April 2006

Nu

mb

er o

f E

ven

ts

0

10

20

30

40

50

60

70

February April March

Diversions

Evacuations

Returns

Smoke in cockpit or cabinevents

Fig. 4. Diversion, evacuations and returns due to smoke – figure reproduced fromFAA – US/Europe International Aviation Safety Conference, by Jim Ballough, June8, 2006


instrument system (EFIS) displays. A week later, a similar fire occurred on aAtlantic Southeast Airlines, creating confusion for the pilots as they struggledwith cascading failures of equipment. It was later found that two of the fireswere close to the pilot’s oxygen supply. Had that happened, the oxygen wouldhave fed the fire and the situation would have been more critical [23].

4 Distributed Sensor Networks

Distributed Wireless Sensor Networks is a class of distributed systems wherethere are tiny nodes which perform semi-autonomous or autonomous tasksand interact with their neighbors in an ad hoc manner. These nodes have sim-ple processing, storage, sensing and communication capabilities. These nodescould potentially be equipped with various types of sensors like photo sensor,temperature sensor, humidity sensor, vibration sensor and more (according tothe application need). Collections of such nodes can communicate and coor-dinate by their radio antenna.

These tiny wireless nodes when deployed, provide information in termsof measured parameters about its immediate surroundings. Such densely de-ployed sensor nodes are good building blocks for built-in intelligence. Today’stechnology allows the sensor fabric to be conformal and may even be partof the inner skin/contour of the aircraft. When such embedded sensing andcomputation surrounds the aircraft, the notion of ambient intelligence will berealized. In such scenarios, tiny, embedded sensors collect data, apply localaggregation techniques, interact with its neighbors for reliability and informa-tion passing. Such deployment will (in most cases) be unobtrusive in aircraftmonitoring and control.

There are several advantages of wireless sensors. These nodes are small,light-weight and do not need wires. Wiring is expensive, in particular, whenlots of traditional sensors are installed in the aircraft; the wiring would becomea major problem in terms of maintenance and routing. Moreover, wired sensorswould prevent them from being mobile. Wireless sensors can enable the nodesto be installed close the places where they can sense interesting phenomena.Hence, in cockpit monitoring and real-time analysis of anomalies, wirelesssensor networks would be an inevitable requirement. There are significantadvantages in deploying a large number of inexpensive sensors as against thedeploying a few expensive (but highly accurate) sensors. One can achieve, atsmaller or comparable total system cost: much higher spatial resolution; higherrobustness against failures through distributed operation; uniform coverage;small obtrusiveness; ease of deployment; reduced energy consumption; and,consequently, increased system lifetime. Sensors are usually positioned closeto the source of potential problem (or interest) from where the acquired datamight have greatest benefit or impact.

Sensors could be placed in many areas of the aircraft: engine nacelles, classC cargo compartments, electronics bays, lavatories, auxiliary power units, hot


bleed air ducts, and the wheel wells. A huge number of sensors could beplaced in the ventilation system, air ducts, above the cockpit ceiling, belowthe pilot’s floor, behind the instrument panels and numerous other places.The only factor which could limit the number of places where sensors couldbe installed is the physical dimension of the sensor.

The wireless sensor network system for aircraft safety should have thefollowing characteristics:

• Lower rate of false alarms.• Should be capable of detecting a wide spectrum of problems (smoldering

and flaming fires, smoke, chemical odors, etc.).• Quick response time (low latency).• Reliable and quick detection of noxious fumes like CO, CO2, CH4, total-

hydrocarbons, HC1, HF, SO2 and NO.• Compatible with system upgrades.• Pass adequate tests for precision sensing using multi-modal sensors.• Pass the rigorous testing standards of the wireless nodes themselves.• Easy to install, inspect, replace, test and standardize for aircraft safety.

Smell is usually the first indication of a fire or potential fire. Specializedsensor nodes could be installed in a lot of places to detect various fumes and

Fig. 5. A typical cockpit of MD-11. Image borrowed from the SR 111 InvestigationReport by Transportation Safety Board of Canada


provide early warning. This initial warning will go a long way in saving lives.Usually, the difficult part is in locating the source. It is not explicitly requiredthat crew members be trained to identify the location of a hidden fire or toknow how to gain access to the area behind interior panels. In such scenarios,tiny sensors would help in localizing the problem and direct the crew.

4.1 Radio Interference

In commercial aviation, though the major goal is to provide inflight Internetconnectivity to the passengers, a parallel major goal is being worked upon.This other goal is to ensure that the present and future aircraft wireless net-works (AWNs) do not interfere with the avionics. Such technology is beingdeveloped by Connexion by Boeing and OnAir [24]. There are standardiza-tion process going and the military domain is the most mature now in termsof ubiquity and standardization. Link-16, otherwise called as Tactical Digi-tal Information Links J (TADIL J) [25] and JTRS [26] are two examples ofstandards toward this effort.

Electronic devices might pose problems in aeroplane control systems al-though there are no documented cases of civilian airline crashes caused dueto cell-phone or other interference [27]. Any kind of radio transmitters couldinduce transient currents in the wires and could potentially be amplified bythe aluminum airframe. The wireless sensor nodes also produce electromag-netic energy which could potentially disrupt the avionics. Most of the current

DSN Integration with Avionics

Wireless SensorsWireless Base(s)

Data Analysis

and Simulation

Visual Display at Cockpit

Fig. 6. Integration of distributed sensor networks with the avionics


sensors operate in the ISM band, 2.4 GHz or use the UWB. Another scenariowhich could disrupt the avionics is the flow of bit streams of ones and ze-ros and it cold introduce new data in the normal data stream in the aircraftcontrol systems. Though it could be rejected by error-correcting algorithms,it could result in an interruption rather than any deviation from the norm.It is well known that antennas on top of the plane are important for takeoffand landing and could be affected by radiation from laptops, cellphones andother portable electronic devices (PEDs). Both the next-generation aircraft –the Boeing 7E7 and the Airbus A380 have separate certification proposalsfor onboard wireless LANs. Many companies are working on wireless sensorsystems for aircraft safety [28, 29] which manufacturers products for fire andsmoke detection in aircrafts.

At FAA’s request, RTCA Inc (Radio Technical Commission for Aeronau-tics) was established in May 2003 to study the effects of wireless telecommu-nication devices on avionics equipment. RTCA [30] is a private, not-for-profitcorporation that provides technical analysis and advice to various federal agen-cies (including the FAA) and many different aeronautics companies. This com-mittee also works closely with FCC, European working group on PEDs andthe aging transport Systems Rulemaking Advisory Committee (ATSRAC).The phase-1 research focuses on how the current technologies such as analogand cellular phones, wireless PDAs and 802.11 a,b,g and Bluetooth affect theon-board navigation systems. The phase-2 research focuses on future tech-nologies such as Ultra-Wideband (UWB) devices and pico-cells for telephoneuse on aircrafts.

5 Feature Detection

In the previous sections, we saw about the seriousness of inflight smoke andfire, how that has affected the aviation industry in the past and about distrib-uted sensor networks. In this section, we will explore about some algorithmictechniques used by the network of wireless sensors for its various activities inaccomplishing the goal.

Even though it is important, the requirement is not about sensor beingcapable of sensing different environmental conditions such as molecular gases,heat conduction, condensed phase aerosols, thermal radiation and more. Thereare sophisticated sensors available today that can accurately sense and disam-biguate a plethora of threatening pre-fire conditions such as partially-oxidizedmaterials. It is about how these network of communication-enabled sensorscan coordinate with one another, how they make local and global decisionsabout the event, how they propagate information in the face of lossy links,node failures, etc. how they disambiguate the false positives from the groundtruth, how they localize the event region, how the information can be aggre-gated to enable the simulation system to analyze, visualize and predict thespread of fire and fumes in the plane and more.


Once data starts trickling into the main data analysis and simulation center(usually, a powerful on-board computer in or near the cockpit), the systemwould be able to do the following effectively:

• Analyze the threat pre-conditions (if available).• Analyze the threat data and provide instant information about the possible

devices/instruments at risk. Provide the list irrespective of they being verydisparate equipments. This will help the crew in not getting confused.

• Localize the problem area and provide immediate steps (the standard pro-cedure) to ameliorate the situation.

• If the threat is above a certain threshold, the on-board computer mustautomatically signal the ground stations for emergency.

• Compute real-time risk analysis and simulate the effect of the smoke orfire.

• Provide a visual display of the plume temperature, velocity, concentrationsof noxious fumes, particulate volume fraction, coverage, scattering coeffi-cient, UV, IR emissions, etc. with prediction of the situation for the nextseveral minutes.

“Feature extraction” or “event-region detection” is the task of determiningthose regions of a given environment which happens to produce certain “in-teresting” phenomena. For example, in the case of cockpit hotspot detection,the interesting phenomena would be the detection of unusually hot areas inor around the cockpit, presence of smoke or any kind of pre-fire condition.

DSNAlgorithm

Resilient to multiplenode failures

Reliable informationand communication

Perform at acceptablelatency levels

Easy to test and verifyend-to-end operation

Scalable and capableof future upgrades

Amenable to strictstandardization rules

Easy to reconfigure

Very low to zerofalse positives

Fig. 7. Some important characteristics of DSN algorithm for cockpit hotspot de-tection application


5.1 Iyengar–Krishnamachari Method

Iyengar and Krishnamachari [35] have proposed a neat solution for fault-eventdisambiguation problem in sensor networks. This is critical to the applicationwe have at hand. Many sensors could be faulty and could trigger false alarmsor produce data which are different from the ground truth. The Bayesian fault-recognition algorithm takes into account the notion that data measurementerrors due to faulty sensors are more likely to be uncorrelated and that theevents in the physical space are spatially correlated. This can be shown as:

P (Si = 0|Ti = 1) = P (Si = 1|Ti = 0) = p, (1)

where p is the sensor fault probability, Si is the abstract binary variable whichwill have a value 0 if the sensor measurement indicates a normal value andhas the value 1 if it is an unusual value. Similarly, Ti = 0 indicates the groundtruth when the node is in a normal region and Ti = 1 if the ground truth isthat the sensor is situated in an event region.

To model the spatial correlation of the event values, the authors havebrought the notion of Ei(a, k) where k of the neighboring nodes report thesame binary reading a as node i while the rest (N − k) reports ¬a, then,

P (Ri = a|Ei(a, k)) =k

N, (2)

where Ri is the estimate of the true reading Ti.In order to determine a value for Ri given the information about its own

sensor reading Si and the evidence Ei(a, k), the following Bayesian equationprovides the probability:

P (Ri = a|Si = b, Ei(a, k)) =P (Ri = a, Si = b|Ei(a, k))

P (Si = b|Ei(a, k)). (3)

From the above, we have two cases (b = a), (b = ¬a):

Paak = (Ri = a|Si = a,Ei(a, k))

=(1 − p)k

(1 − p)k + p(N − k),

(4)

P (Ri = ¬a|Si = a,Ei(a, k)) = 1 − P (Ri = a|Si = a,Ei(a, k))

=p(N − k)

(1 − p)k + p(N − k).

(5)

Equations show the statistic with which the sensor node can now make adecision about whether or not to disregard its won sensor reading Si in theface of the evidence Ei(a, k) from its neighbors.

A threshold decision scheme proposed uses a threshold 0 < Θ < 1 whichconveys the idea that if


Fig. 8. Sample scenario: a distributed sensor network with uncorrelated sensor faults(denoted as “x”) deployed in an environment with a single event region (dashedcircle). Image borrowed with permission from [35]

P (Ri = a, Si = b|Ei(a, k)) > Θ, (6)

then Ri is set to a and the sensor believes that its sensor reading is correct.Another scheme called the Optimal Threshold Decision Scheme which firstgets the sensor readings of its Ni neighbors. It then determines ki, the numberof node i′s neighbors j with Sj = Si. If ki >= 0.5Ni, set Ri = Si else setRi = ¬Si.

The threshold decision scheme with Θ is equivalent to picking an integerkmin such that node i decodes to a value Ri = Si = a if an only if at leastkmin of its N neighbors report the same sensor measurement a. In this scheme,Ri = a ⇐⇒ Paak > Θ. We can rewrite (4) as follows:

Paak =(1 − p)k

k(1 − 2p) + pN, (7)

kmin is given by the following expression:

kmin = " pNΘ

1 − p − (1 − 2p)Θ#. (8)

The detailed steps of the scheme are presented in the following page alongwith the optimal threshold decision scheme ([35]):


5.2 Decision Schemes for Fault Recognition

Randomized Decision Scheme

1. Obtain the sensor readings Sj of all Ni neighbors of node i.

2. Determine ki, the number of node i’s neighbors j with Sj = Si.

3. Calculate Paak = (1−p)ki

(1−p)ki+p(Ni−ki).

4. Generate a random number u ∈ (0, 1).

5. If u < Paak, set Ri = Si else set Ri = ¬Si.

Threshold Decision Scheme



3. Calculate Paak = (1−p)ki

(1−p)ki+p(Ni−ki).

4. If Paak > Θ, set Ri = Si else set Ri = ¬Si.

Optimal Threshold Decision Scheme



3. If ki >= 0.5Ni, set Ri = Si else set Ri = ¬Si.

5.3 Simulation Results

To test the performance of the fault recognition algorithm, some experimentswere conducted where the scenario consisted of n = 1, 024 nodes placed in a32×32 square grid of unit area. All sensors were binary: they report a “0” toindicate no event and a “1” to indicate that there is an event. The faults aremodeled by the uncorrelated, symmetric, Bernoulli random variable.

In Fig. 9, the sensor nodes are represented by dots, the bold nodes arein the event region, an “x” shows a faulty node (before the fault-recognition


Table 1. Summary of notations

Symbol Definition

n Total number of deployed nodes

nf Number of nodes in the event region

no The number of other nodes = n − nf

N The number of neighbors of each node

Ti The binary variable indicating the ground truth

at node i

Si The binary variable indicating the sensor reading.Sensor is faulty - Si = ¬Ti

Ri The binary variable with the decoded value.

Decoding is correct - Ri = Ti

Ei(a, k) The event that k of node i’s N neighbors have the

same sensor reading a

Paak The conditional probability P (Ri = a|Si = a, Ei(a, k))

p The (symmetric) fault probability

P (Si = 1|Ti = 0) = P (Si = 0|Ti = 1)

Θ The decision threshold

algorithm), an “o” indicates a node with erroneous reading after fault recogni-tion. Nodes with both “x” and “o” are nodes whose errors were not correctedwhile nodes with just an “o” are the ones whose errors were corrected andnodes with just an “o” are the ones where new errors have been introducedby the fault-recognition algorithm.

5.4 Reasoning with Uncertain or Incomplete Information

In favorable conditions, the sensor nodes would provide correct, timely in-formation and the wireless base would pass it on to the main computer foranalysis and simulation. The information would be abundant and thereby easyto form correct conclusions with good inference rules. On the contrary, thereare (and have been) many situations where the system would need to drawuseful conclusions from uncertain and limited amount of data.

In the previous section, we saw how Bayesian approach was used to dis-ambiguate faults and events. For this theorem, two major requirements are to


Fig. 9. A snapshot of the simulator showing the errors before and after fault recog-nition with optimal threshold (p = 0.1). Image borrowed with permission from [35]

be met: first, all the relationships and their probabilities should be known, aswell as the probabilistic relationships among the pieces of evidence. Second,the relationships with all evidence and hypotheses are to be independent. Thisis a difficult requirement to establish. In many cases such requirements aredifficult to be met and hence heuristics would be used.

Figure 10 shows how a set of symptoms might be related to a set of causes.Due to the complexity faced in applying the principles of probability theoryfor various real world problems, other theories were founded where the com-plexity could be relaxed and the problem domain shrinked to a smaller setof relevant events and evidence. Bayesian Belief Networks does exactly this.The dependency constraints are relaxed and joint-probability table need notbe full-fledged with all possible combinations. This provides a probabilisticgraphical model which represents a set of variables and their causes where theprobabilities among them are known. A directed acyclic graph is used to showhow events influence the likelihood of each other. It is acyclic so that there isno reasoning that depends on its own cause(s). Such a model can be used todetermine the probabilities of various possible causes for an event.


Fire

Cause 1

SmokeFailure of

EntertainmentSystem

NO2

CockpitPanel Lights

Off

Cause 2 Cause 3

Fig. 10. A simple diagram showing the relationships between a set of symptomsand causes

5.5 Sensor Data Analysis using Dempster–Schafer Theory

Computers are pushed to understand the real-world problems using variouscontext-aware computing methods. There are now ways to decompose complexinformation into a simple forms of information sets. Very often, uncertaintyresults from a combination of missing evidence, limitations of our knowledgeand heuristics. The previous techniques of Bayesian Probability and BeliefNetworks which were discussed briefly does provide some solutions for reason-ing under uncertainty, but, they call for the use of quantifiers/measurementsfor any complex situation.

There are different kinds of evidences that are involved during decisionmaking: Consonant evidence, consistent evidence, arbitrary evidence and dis-joint evidence. Consonant evidence is the case where the readings of one sensoris a subset of the readings of another one and so on as depicted in Fig. 11.Suppose there are 4 sensors with varying capacities S1, S2, S3 and S4. If S1

detects smoke in its region W, S2 detects both smoke and fire in its regionW and X respectively, S3 detects smoke, fire and NO in region W, X andY respectively and S4 detects fire, smoke, NO and CO2 in region W, X, Yand Z. In Consistent evidence, there is atleast one evidence that is commonto all the sensor readings. In Arbitrary evidence, there would be no readingthat is common to all subsets of readings, but some subsets may have somecommon readings. And finally, Disjoint evidence has no readings common toany sensor readings.

The Dempster–Shafer theory of evidence [37, 38] uses sets of ranges

[Belief, P lausibility], (9)


W

X

Y

Z

W

X

Y

Z

Consonant Evidence Arbitrary Evidence

Fig. 11. An example of consonant evidence and arbitrary evidence types

where the degree of belief for each proposition must lie. A zero value indicatesthat there is no evidence to support a set of proposition and a one valuesuggests certainty.

bl(A) =∑

B⊆A =φ

m(B) and pl(A) =∑

B∩A =φ

m(B). (10)

Suppose the various events we are interested are

Θ = {Fire, Smoke,HCL,NO,CO2,H2O,CH4,SO2}, (11)

and suppose a sensor Si senses “Fire”, the sensor Si will assign a belief overΘ known as probability mass function mi. The beliefs are computed basedon the density function m called the mass function (or the Basic ProbabilityAssignment, BPA):

m(Φ) = 0 and∑A⊆U

m(A) = 1, (12)

where m(A) denotes that all the evidence that supports the claim belongs toset A. It doesn’t tell anything more or about any particular subset of A. Asubset A ⊆ U with m(A) > 0 is called a focal element of m.

According to Dempster–Shafer theory of evidence, the probability of theobserved value is represented by the interval

[Beliefi(Fire), P lausibilityi(Fire)], (13)


The belief measure, bl is the lower bound and the plausibility measure, plis the upper bound which is defined as

pli(Fire) = 1 − {bl(¬Fire)}. (14)

Dempster’s combination rule pools together various evidences from differentsources which are independent of each other and combines them into one bodyof evidence. If m1 and m2 are the mass functions of two independent bodiesof evidence defined in a frame of discernment U , then, the new evidence isdefined by a new mass function m on the same frame U by:

m(A) = m1

⊕m2 =

∑B∩C=A m1(B)m2(C)

1 −∑

B∩C =Φ m1(B)m2(C). (15)

5.6 An Example Application of Dempster–Shafer Theorem

We shall look at a simple case of reasoning [36] using this theorem when thereare inputs from sensors. Suppose a sensor Si reports “smoke” in the cockpit.And, let us say the we have a subjective probability for the reliability of thissensor Si and that is 0.9 and that Si is unreliable is 0.1. It is important to notethat he report that there is smoke in the cockpit is true if Si is reliable, but,not necessarily false if Si is unreliable. So, Si’s data alone justifies a degreeof belief of 0.9 that there is indeed smoke in the cockpit and 0.0 belief thatthere is no smoke in the cockpit. This belief of 0.0 does not mean that we aresure that there is no smoke in the cockpit, as a probability measure of 0.0would indicate. It merely means that the data report by si gives us no reasonto believe that there is no smoke in the cockpit. The plausibility measure inthis case would be

pl(smoke) = 1 − bl(¬smoke) = 1 − 0.0, (16)

which is 1.0 and the belief function for Si would be [0.9, 1.0]. This also saysthat we still do not have any evidence that there is no smoke in the cockpit.

Let us now consider Dempster’s rule for combining evidence and see howwe can reason with more data from more sensors. Suppose another (a geo-graphically close) sensor Sj also reports that there is smoke in the cockpitand if we know the probability that Sj is reliable is 0.8 and that it providesunreliable data with a probability of 0.2. Now, it is important to consider thereports of Si and Sj to be independent of each other, that it, they have sensedand acted independently and have not used any sort of algorithm where Si

overhears Sj ’s report and vice-versa and thereby correlate and aggregate thedata before they transmit their data out. The reliability of Sj is also inde-pendent of that of Si. The probability that both Si and Sj are reliable is theproduct of their reliabilities, which is 0.72; the probability that they both areunreliable is the product 0.02. The probability that at least one of the two isreliable is 1 - 0.02, which is 0.98. Now, from this information, we can say that


there is a probability of 0.98 that there is smoke in the cockpit (since both thesensors reported the presence of smoke) and atleast one of them is reliable.Therefore, we can now assign to the event that there is smoke in the cockpita [0.98, 1.0] degree of belief.

Now, let us consider a different situation. Suppose that Si and Sj reportconflicting data: Si reports the presence of smoke in the cockpit and Sj reportsabsence of smoke in the cockpit. Now, in this situation, we know that both thesensors cannot be right and both of them cannot be reliable. So, either bothSi and Sj are unreliable or only one is reliable. We know that the probabilitythat only Si is reliable is

0.8 × (1 − 0.9) = 0.08, (17)

and that neither is reliable is

0.2 × 0.1 = 0.02. (18)

Given that atleast one is reliable,

(0.18 + 0.08 + 0.02) = 0.28, (19)

we can also compute the posterior probabilities that only Si is reliable as

0.090.28

= 0.321, (20)

and there is smoke in the cockpit, or the posterior probability that only Sj

was right,0.080.28

= 0.286, (21)

and there is no smoke in the cockpit.The above computation is an example where Dempster rule was used to

combine beliefs. When both the sensors reported that there is smoke in thecockpit, we took three hypothetical situations that supported that there issmoke in the cockpit: Si and Sj are both reliable; Sj is reliable and Si is not;and Si is reliable and Sj is not. The belief, 0.98, was the sum of these possiblesupporting hypothetical scenarios. In the second use of the Dempster rule, thesensors reported conflicting data and again, in that situation, we took threepossible scenarios. The only impossible situation was that they were bothreliable. Hence, the possible scenarios we considered were Si was reliable andSj was not, Sj was reliable and Si was not and neither was reliable. The sumof these three gives a belief of smoke in the cockpit of 0.28. The belief thatthere is no smoke (Sj ’s report) is 0.286 and since the plausibility of smoke is1 − bl(¬smoke) or 0.714, the belief measure is [0.28, 0.714].

This combination rule focuses on those propositions where only both thebodies of evidence support. This rule can be applied when different data fromvarious sensors are flowing in and the decision making algorithm could apply


this rule to arrive at a logical conclusion. Moreover, this combination rulecan be iterative by having one of the m′s as an already combined (usingDempster–Shafer combination rule) observation of two other sensors Sj andSk. The Dempster–Shafer theory is more closer to way humans process in-formation and reason based on the available set of data. The capability toassign uncertainty or ignorance to propositions is a powerful and importantidea for when dealing with a large spectrum of real-world problems that wouldotherwise seem intractable.

To recapitulate, this theory addresses the problem of measuring certaintyby fundamentally distinguishing between the absence of evidence (uncer-tainty) and ignorance. On the contrary, in probability theory, one is expectedto quantify any kind of knowledge or hypothesis h that is available in theform of a single number p(h). But, one cannot expect or calculate the valuesof the probabilities all the variables that contribute to a certain phenomenaor always know the values of prior probabilities.

6 Discussion and Conclusion

Detecting hazards in the cockpit and in the aircraft as a whole is a challengingtask. The NTSB, FAA and other organizations including the manufacturersof planes and other equipments have been constantly upgrading potentiallyhazardous parts of aircrafts. The aging of an aircraft, the many kilometersof wiring, the swing in the temperature, humidity and many other factorslead to fire and smoke hazards. Faults once thought to be rare and benignare now being noticed in many passenger planes. The need for efficient androbust unsupervised detection, classification and localization of hazards usingmultiple sensors is of primary importance. Though the system is in placetoday, there is still a need for improvement due to recent incidents.

Solutions to inflight smoke and fire, particularly in the cockpit area arenot straightforward. Prognosis and prevention is better in this case than real-time diagnosis of the situation. There have been many research going on inseveral areas of safety in planes. Different computational tools would be re-quired to solve different engineering problems like the problem of intelligentdecision making, inference or deduction from a set of facts. The technologyreadiness level (TRL) of development efforts in prognostics for rotating ma-chinery, electronic power supplies and digital avionics are being studied now.Advanced diagnostics use model-based reasoning (MBR) and data mining.There are statistical methods to ensure data qualification such as: stationar-ity of data, presence of periodicities in the data and normality of the data.These characteristics may or may not be exhibited by any or all of the hazardsymptoms.

We discussed about detecting hotspots in the cockpit in the light of SwissAir 111 and ValuJet 592 disasters. We focused on how distributed sensornetworks could help in near real-time event detection, disambiguate faults and


events and use AI techniques such as Dempster–Shafer theory to evaluate thesituation when there is multiple data, missing and/or uncertain data. Theuse of fault-event disambiguation using the Iyengar–Krishnamachari methodwas discussed which described techniques to distinguish between faulty sensormeasurements and interesting events in the deployed area. The optimizedscheme solution was based on Bayesian probability. The use of Dempster–Shafer was discussed with an example to show the efficacy of this theory asa potential solution to such class of problems. Eventhough it is theoreticallyattractive, it is to be noted t hat there are some disadvantages and outstandingissues with this theory which have been addressed heavily in the recent past.The advantage in using distributed sensor networks to detect hotspots in hard-to-reach places of the cockpit and other areas and provide exact locationof the symptoms. The use of AI techniques would further enable the crewin analyzing the situation and make wise decisions in emergency. Despitethe presence of many techniques such as (ANN), transferable belief model(TBM) and more, we have focused on Dempster–Shafer as a possible solutionto the case of detecting cockpit hotspots. As pointed out earlier in the chapter,we reiterate the fact that no matter what the solution is to improve themonitoring and control of inflight hazards, they should be easy to test, passthe stringent standardization process and many other practical hurdles besidesthe solution to this engineering problem.

References

1. http://www.airlinesafety.com/faq/faq8.html2. Code of Federal Regulations 14 CFR Part 25.858.3. Captain John M. Cox (2006), “Reducing the risk of smoke and fire in transport

airplanes: past history, current risk and recommended mitigations”, The 23rdAnnual International Aircraft Cabin Safety Symposium, Oklahoma City, 13–16February, 2006.

4. A. Konar (2005), “Computational Intelligence: Principles, Techniques and Ap-plications”, Springer-Verlag, Berlin. ISBN: 3-540-20898-4

5. M. Schwabacher, J. Samuels and L. Brownston (2002), “The NASA integratedvehicle health management technology experiment for X-37”, in the Proceedingsof the SPIE AeroSense 2002 Symposium.

6. Dr. Celeste M. Belcastro, Cheryl L. Allen, “Aviation Safety Program, IntegratedVehicle Health Management, Technical Plan Summary”.

7. Joesph A. Castrigno, Stephen J. Engel and Barara J. Gilmartin (Fall/Winter2006), “Vehicle Health Management: Architecture and Technologies”, Technol-ogy Review Journal.

8. C.M. Belcastro, F. Chowdhury, Q. Cheng, J. Michels, P. Varshney (2005), “Dis-tributed detection with data fusion for aircraft flight control computer malfunc-tion monitoring”, AIAA Guidance, Navigation, and Control Conference andExhibit, San Francisco, CA.

9. Christine M. Belcastro and Celeste M. Belcastro (2001), “Application of fail-ure detection, identification and accomodation methods for improved aircraft


safety”, Proceedings of the American Control Conference, Arlington, VA June25–27, 2001.

10. http://www.impact-tek.com/11. J.C. Bezdek (1994)“What is computational intelligence?” In Computational In-

telligence Imitating Life, Zurada, J.M., Marks, R.J. and Robinson, C.J. (Eds.),IEEE Press, New York, pp. 1–12.

12. R.J. Marks (1993) “Intelligence: computational versus artificial,” in IEEETransactions on Neural Netwoks, 4:737–739.

13. J. Shaw (2000). A review of smoke and potential in-flight fire events in 1999.Washington, DC: Society of Automotive Engineers. Doc 185.

14. F. Jia, M.K. Patel, E.R. Galea (2004), “Simulating the Swissair Flight 111 in-flight fire using the CFD fire simulation software SMARTFIRE”, The FourthTriennial International Fire and Cabin Safety Research Conference, Lisbon,Portugal.

15. International Air Transport Association (IATA). (2005).On-board fire analysis:From January 2002 to December 2004 inclusive. Quebec, Canada: Author. Doc176.

16. P. Halfpenny (2002). IFSD probability analysis. Washington, DC: Author.Doc 6.

17. NTSB. (1974, December 2). Aircraft accident report: Pan American World Air-ways, Inc. November 3, 1973 (NTSB-AAR-74-16). Washington, DC: NTSB. Doc27.

18. Commission of Enquiry. (1977, March). Aircraft accident: Cubana de Aviacion,DC8-43October 6, 1976. Bridgetown, Barbados: Commission of Enquiry. Doc136.

19. FAA. (2005, November 23). NRPM: Reduction of fuel tank flammability intransport category airplanes; Proposed rule, 70(225), Federal Register pp.70922–70962. Doc 257.

20. Boeing Aero No. 14. (2000). In-flight smoke. Retrieved May 18, 2005, fromhttp://www.boeing.com/commercial/aeromagazine/aero 14/inflight story.htmlDoc 28.

21. TSBC. (2003, March 27). Aviation investigation report: In-flight fire leading tocollision with water Swissair Flight 111 September 2, 1998. Quebec, Canada:TSBC. Doc 188.

22. International Air Transport Association (IATA) (2005). On-board fire analysis:From January 2002 to December 2004 inclusive. Quebec, Canada: Author. Doc176.

23. Washington Post: In-Flight Fires an Unresolved Safety Threat, October 17,2006.

24. http://www.onair.aero/25. http://www.fas.org/irp/program/disseminate/tadil.html26. http://www.boeing.com/defense-space/ic/jtrs/index.html27. J. Hannifin, “Hazards Aloft” Time, Feb. 22, 1993, p. 61.28. http://www.securaplane.com/29. http://ww.raesystems.com/30. http://www.rtca.org/31. R. Shorey, A. Ananda, M. Choon Chan and W. Tsang Ooi (2006), “Mobile,

Wireless and Sensor Networks: Technology, Applications and Future Direc-tions”, IEEE Press, Wiley, New York. ISBN-10 0-471-71816-5


32. B. Krishnamachari (2005), “Networking Wireless Sensors”, CambridgeUniversity Press. ISBN-10 0-521-83847-9

33. A. Hac (2003), “Wireless Sensor Network Designs”, Wiley, New York.34. H. Karl and A. Willig (2005),“Protocols and Architectures for Wireless Sensor

Networks”, Wiley, New York. ISBN: 0-470-09510-535. B. Krishnamachari and S. Iyengar (2004), “Distributed Bayesian algorithms

for fault-tolerant event region detection in wireless sensor networks”, in IEEETransactions on Computers 53(3).

36. George F. Luger, William A. Stubblefield (1998), “Aritificial Intelligence -Structures and Strategies for Complex Problem Solving”, 3rd Edition, AddisonWesley, Reading, MA. ISBN: 0-805-31196-3

37. A.P. Dempster (1968). A generalization of Bayesian inference, Journal of theRoyal Statistical Society, Series B 30 205–247.

38. G. Shafer (1976). A Mathematical Theory of Evidence. Princeton UniversityPress.

39. K. Sentz and S. Ferson (April 2002), “Combination of Evidence in DempsterShafer Theory”, Sandia National Laboratories, SAN 2002-0835.

40. H. Wu, M. Siegel, R. Stiefelhagen, J. Yang (2002), “Sensor Fusion UsingDempster–Shafer Theory”, IEEE Instrumentation and Measurement Technol-ogy Conference, Anchorage, AK, USA, 21–23 May 2002.

Part V

Computational Intelligence in Video Processing

Anthropocentric Semantic InformationExtraction from Movies

Nicholas Vretos, Vassilios Solachidis, and Ioannis Pitas

Department of Informatics, Aristotle University of Thessaloniki, P.O. Box 451,54124 Thessaloniki, [email protected],[email protected],[email protected]

Summary. In this chapter we will describe new methods for anthropocentricsemantic video analysis, and will concentrate our efforts to provide a uniform frame-work by which media analysis can be rendered more useful for retrieval applicationsas well as for human–computer interaction based application. The main idea be-hind anthropocentric video analysis is that a film is to be viewed as an artworkand not as a mere of frames following each others. We will show that this kind ofanalysis which is a straightforward approach of human perception of a movie canfinally produce some interesting results of the overall annotation of a video content.“Anthropos” which is the greek word for “human” show the intent of our proposi-tion to concentrate in humans in a movie. Humans are the most essential part of amovie and thus we track down all important features that we can get from low-leveland mid-level feature algorithms such as face detection, face tracking, eye detection,visual speech recognition, 3D face reconstruction, face clustering, face verificationand facial expressions extraction. All these algorithms produce results which arestored in an MPEG-7 inspired description scheme set which implements the wayhumans are connecting those features. Therefore as a results we have a structuredinformation of all features that can be found for a specific human (e.g. actor). As itwill be shown in this chapter this approach as a straightforward approach of humanperception provides a new way of media analysis in the semantic level.

1 Introduction

Humans (actors) are the most important entity in most movies. This chapterdescribes techniques for extracting semantic information regarding human ac-tors for movie content description. Image and video processing algorithmshave attained a certain maturity, so as we be able to attempt semantic infor-mation extraction from movies. On one side, many algorithms have been de-veloped that extract low-level features, which achieve very good performance.On the other side, the semantic gap between low and high-level features hasnot yet been bridged, despite the efforts towards that end. Therefore, duringthe last years, significant effort has been concentrated towards movie content

N. Vretos et al.: Anthropocentric Semantic Information Extraction from Movies, Studies in



438 N. Vretos et al.

description by using automatically created high-level information. The mainargument for extracting high-level features, is that a movie is not just a videofile, but it should be considered as a work of art, that has been created ac-cording to narrative structure and certain cinematographic rules. Novel mediatechnologies required movie semantic information retrieval, so as to providenew products and services. Such are IPTV, indexing and retrieval applications,personalized multimedia services, home videos, interactive/flexible television,3DTV and non-linear movie narrative creation.

Semantic video information extraction can be subdivided in three mainlevels: high, medium and low ones. These three levels of abstraction serve inorder to characterize the different amount of information that can be extractedfrom different supports. The low level information is the one where pure tech-nical characteristics of a video are extracted, like the dominant color of a frameor the histogram of a frame, the frequency spectrum of a frame, the frameratio, the fps (frames per second), etc. The medium level contains informationwhich is better interpretable from humans but still they lack contextual infor-mation. Such information come from, e.g. face detection, face tracking, shotcut detection and other algorithms which operate on video and/or low levelfeatures. Finally, the high-level information can summarize video narrative byemploying contextual information. It uses the results, e.g. of face clustering,face recognition, scene boundaries detection to describe human actions, sta-tus and interaction (e.g. dialog) towards describing the movie narrative in aformal way.

In all these cases, the ultimate goal of movie content analysis and de-scription is to describe human status, actions and interactions with otherhumans and with the context (e.g. physical scene). We call this approach to(semi) automatic video analysis anthropocentric (human centered). The termcomes from the Greek word “Anthropos” (human being). Anthropocentricapproaches have been given a serious attention during last years under theconsideration that humans are the most important “object” in a movie. Themain algorithms which provides anthropocentric data as well as the way theyconnect to each other is shown in Fig. 1.

Face detection and tracking (FD & FT) are essential tasks in anthropocen-tric semantics extraction. They are used in the intermediate level and theirresults are the basis for almost all other anthropocentric semantics extrac-tion tasks. They discover and track human faces within a video. A very goodreview of these tasks can be found in [1–5].

Facial feature extraction (FFE) is a very useful tool in semantic extractiondue to the fact that face features like eyes, lips and mouth can contributein many algorithms in order to provide semantic information such as facialexpression analysis or dialog detection or FR. Recent research efforts tacklethis problem either in a holistic way, where all possible facial features areextracted like in [6] or in a per facial feature base where each time a differentfeature is detected [7, 8].

Anthropocentric Semantic Information Extraction from Movies 439

Fig. 1. Anthropocentric analysis framework. Basic modules interconnection scheme

Visual speech recognition (VSR) is a domain of research where we try tovisually understand whether an individual is talking by employing video infor-mation only. This method is usually combined with audio speech recognition(ASR) in order to provide robustness to the former in noisy environments.It use mouth tracking and visemes detection to infer what the speaker istalking about. Many attempts have been undertaken in order to tackle thisproblem [9,10].

3D face reconstruction (3DFR) from uncalibrated images is an importantproblem of video processing due to the fact that it is very useful in many othertasks such as Face Verification, FFE, FC and pose estimation. In [11,12] thereare many interesting algorithms which trying to tackle the problem by usingdifferent approaches. As it will be detailed later though, 3DFR results are stillquite modest due to the fact of the poor quality of the input images.

Face clustering (FC) is a young, yet prominent approach in semantic ex-traction and aims in categorization of faces before actual face recognition,since in some cases we are interested to know the number of the actorsin a scene rather than their identities. It is a method which clusters actor


appearances in variant video frames. There are recent attempts to tackle theproblem with some very interesting results as we will se later on [13–16].

Face recognition (FR) is a processing task which fulfills the need for asemantic interpretation of movie content. Numerous attempts have been un-dertaken in the recent years [17–20]. Two classes of FR methods exist theappearance based and the model based.

Facial expression analysis (FEA) is a domain where the goal is to be able tocharacterize the expressions of a human face. Towards that end psychologistsdefined a set of six basic expressions (anger, disgust, fear, happiness, sadnessand surprise), whose combinations produce any “other” facial expression [21].In order to make the recognition procedure more standardized, a set of musclemovements, known as action units, was created by psychologists, thus formingthe so called facial action coding system (FACS) [22]. These action units arecombined into facial expressions according to the rules proposed in [23].

All of the above methods mostly reside in the medium level of semanticcategorization. In order to extract high-level information from these semanticswe have to combine them in a contextual way. Furthermore, we need a wayof storing results in a way, perceptible by humans. We shall present in thischapter a way to do so, which is called anthropocentric video descriptionscheme (AVCD) aims at storing low and mid-level information in a way thathigh-level information can be constructed. Afterwards, this way high-levelinformation queries can be applied in a retrieval system and, therefore, thisstructure can be used to answer with the appropriate video file. The mainobjective is the way of combining low-level features in order to retrieve high-level information. In this chapter, image and video processing algorithms aswell as novel data structures for semantic information extraction from moviesare discussed. The state of the art in this domain is presented with emphasisto video feature extraction algorithms and related applications.

2 Face Detection and Tracking

Video-based tracking of the motion of the human body has been a challengingresearch topic with applications in many domains such as human–computer in-teraction, surveillance, hand gesture recognition and 3D reconstruction. Sucha task is usually preceded by an initialization step that aims at detecting thepresence of people, notably faces. The latter has been often tackled by facedetection. However, pose variations (frontal, profile and intermediate poses),skin-color variations, facial structural components (moustache, beards andglasses), occlusion and poor or variable imaging conditions make this taska rather difficult one. For details on face detection methods, the reader isreferred to [1].

Tracking techniques can be divided into active and passive tracking. For areview of the former, [3] is recommended. Computer vision researchers havebeen trying to achieve results comparable to active tracking using passive


techniques (for a long time), in an effort to produce generally applicable mo-tion tracking systems for uncontrolled (indoor or outdoor) environments. Fora comprehensive review of passive tracking methods, the reader is referred to[4, 5].

We shall present a system that aims at robust face detection and tracking,as well as object tracking. More details for this face detector/tracker canbe found in [3]. This approach for face detection was motivated by [24] and[25] and involves fusion of information available from two separate detectorsin order to produce more accurate results than each detector alone, as wellas to complement each other in case of failures. The tracking algorithm ofthis system is a variant of the Kanade–Lucas–Tomasi tracker [26], capableof dealing with still or slowly video objects. This system can operate in twodifferent modes (automatic and semi-automatic) and is capable of trackingeither automatically detected faces or any other manually selected object(s) ofinterest. In the semi-automatic mode, user intervention is required to initializethe regions to be tracked in the first frame of the video sequence. Manualintervention is also allowed in other cases, such as the initialization of thetracking algorithm for new faces entering the scene, re-initialization if anyof the tracked faces is lost and correction of erroneous tracking results. Thelatter refers to stopping the tracking of erroneously detected objects, as well ascorrecting the tracked region, so as not to contain portions of the background.Obviously, in case of manual initialization, the system can be used to trackany object(s) of interest, other than faces. In its default configuration, it cancope with a range of different environments. However, a number of parameterscan be fine-tuned. An overview of the system is illustrated in Fig. 2a.

Novel contributions of this method include the addition of a color-basedthresholding step into the frontal face detector presented in [25], in order toreduce false detections in complex scenes. Additional geometrical criteria, aswell as a facial feature extraction step are also employed in order to makea color-based face detection algorithm similar to the one presented in [24]more robust to false detections. Moreover, a fusion scheme that combines theresults of the two separate detectors is developed, aiming at reliable detectionof faces in various poses (frontal, profile, intermediate) and orientations.

2.1 Face Detection Based on Fusion

The face detection module of this system employs two different face detectionalgorithms based on color [24] and Harr-like features [25], respectively. A fu-sion scheme that combines the two algorithms and employs additional decisioncriteria to improve the detection rate and reduce false detections is incorpo-rated in order to handle as many different detection scenarios as possible.Fusion is essential, because an automatic system for face detection, especiallywhen applied as an initialization step in a system for tracking people, should beable to cope with frontal to profile face poses, as well as different orientations.


SystemDefault Parameters

User Input

or

INPUT

Detector

TrackCalibrationParameters

Text Fileand/or

Video File

(a)

Final

Skin-like Thresholding

color

Detections

Rem

ainingInputImage

Skin-like Thresholding

Rem

ainingFaces

Faces

(b)

FaceDetection

TrackFeatures

OcclusionHandling

Result

NO

YES

(c)

Fig. 2. Schematic diagrams: (a) overall system; (b) detection module; and (c)tracking module


However, the computational efficiency should be high enough to allow for fastdetection and not limit its applicability in real-world environments.

2.2 Color-Based Face Detection

Using color as the primary source of information for skin detection has beena favorable choice among researchers. Consequently, there have been a num-ber of attempts to determine the optimum color space for skin segmentation.Researchers have concluded that the skin color distribution forms a cluster(the so-called skin locus) in various color spaces [27, 28], which is however,camera-specific. For a comprehensive discussion on skin color detection tech-niques, the reader is referred to [29].

The color-based algorithm used is similar to the one in [24]. Skin segmen-tation in the hue-saturation-value (HSV) color space, which has been populardue to its inherent relation to the human perception of color, is used. Moreover,the V component (intensity) is ignored, in order to obtain at least partial ro-bustness against illumination changes, resulting in a 2D color space. Insteadof modelling skin color distribution using non-parametric methods, such aslookup tables (LUT), Bayesian classifiers or self-organizing maps or paramet-ric methods (single Gaussian, mixture of Gaussians or even multiple Gaussianclusters), the presented method employs a skin classifier that explicitly definesthe boundaries of the skin cluster in the HS(V) color space.

The input image is first converted into the HSV color space. The H, Svalues of all the individual pixels are tested against appropriate thresholds.(These thresholds are used similarly to the ones used in [24]).

f(h) ={

1 , 0 < h < 0.150 , otherwise (1)

and

g(s) ={

1 , 0.2 < s < 0.60 , otherwise (2)

with h and s values in the interval [0, 1]. A pixel will be classified as skin-likeonly if f(h) g(s) = 1. Such a method is attractive because of its simplicityand the ability to construct very fast skin color classifiers. Since the detec-tion method involves a combination of two detectors, it is essential that thecomputational burden is kept low.

The skin segmentation results are morphologically processed [24]. Con-nected component analysis is the next step. The number of contour points ofeach connected component is tested against a threshold, to ensure that thesubsequent ellipse fitting process is applied only to large enough regions. Theshape for each connected component is then examined by an ellipse fittingalgorithm to further reduce the number of candidate regions. The best-fit el-lipse is computed using the general conic-fitting method presented in [30], withadditional constraints to fit an ellipse to scattered data. Additional decision


criteria (ellipse orientation, ratio of the ellipse axes, area occupied by theellipse) are incorporated to ensure that invalid ellipses will not be fit. Thethresholds for the criteria that have been determined by experimentation arethe following: N > 10 ∗ scale, 1.6 < b

a < 2.5, A > 36 ∗ scale, 45◦ < θ < 135◦,where N is the number of contour points of the connected component, a andb denote the lengths of the minor and major axis of the ellipse, respectively,A is the area occupied by the ellipse, θ is the angle between the horizontalaxis and the major ellipse axis (i.e. the orientation of the ellipse), in degrees,and scale is a parameter associated with the size of the input images.

Color-based detectors suffer from false detections, due to the presenceof other foreground or even background objects that exhibit similar colorand shape properties with, e.g. faces or hands. For this reason, the resultingcandidate regions are then subjected to a facial feature extraction process toreduce false detections. The first order derivative with respect to the verticalaxis of the input image I is calculated by applying an extended Sobel operator.

The resulting image J is then thresholded to produce a binary image B,according to:

B(i, j) ={

1 , J(i, j) > J(i, j)0 , otherwise,

(3)

where J(i, j) denotes the average grayscale value of all image pixels. The al-gorithm can correctly detect frontal faces. However, skin-like areas irrelevantto the subsequent tracking process can often be included in the detected faces(i.e. the neck of the subjects), as can be seen in Fig. 3a. This can cause prob-lems to the tracking module. The algorithm will fail in rare cases (e.g. if thesubject wears clothes with skin-like colors, folds in the clothes can potentiallyconfuse the detector, as illustrated in Fig. 3b,c.

Face Detection Based on Haar-Like Features

The second detector used is the frontal face detector [25], with very goodresults on frontal test datasets. Exposure to real-world conditions might pro-duce false detections, as illustrated in Fig. 3d. To overcome false detections,the algorithm is modified so as to include a color-based thresholding step,identical to the initial skin-like segmentation step of the color-based detec-tion algorithm, as specified by (1) and (2), but applied to each face regiondetected instead of the whole image. Since a face in any pose or orientationshould contain a large portion of skin, thresholding on the number of skin-likepixels is also employed. This eliminates false detections associated with thebackground, while maintaining all correctly detected faces, as can easily beseen in Fig. 3e. The algorithm can correctly detect frontal faces, but irrelevantareas (portions of the background) might be included in the detected faces.

Fusion of Color-Based and Feature-Based Detectors

The problem of detection is essentially split in two separate tasks: frontal andnon-frontal face detection. The frontal case is mainly handled by the frontal


(a) (b)

(c)

(d) (e)

(f)

Fig. 3. Face detection. (a) Erroneous detection regions (including the subject’sneck), produced by the color-based detector, (b–c) false detections produced by thecolor-based detector, (d) false detections produced by the feature based detector,(e) elimination of false detections by means of a skin-like threshold, (f) results offusing the two detectors

face detector used in [25], modified by incorporating the color-based thresh-olding step described earlier. The color-based face detection scheme describedearlier is responsible for detecting faces in different poses and orientations, as


Fig. 4. Correct detections produced by the fusion of two detectors in sample framesof a video sequence

well as for supplementing the results of the frontal face detector. The combinedalgorithm proceeds as follows. Both algorithms are applied to the input image.The intersections of the frontal face regions detected by both detectors are theones accepted as frontal faces. However, there are cases when either of the twodetectors will detect frontal faces that the other one has missed. These addi-tional faces are also accepted. Finally, the color-based detector is responsiblefor detecting faces in poses and orientations other than frontal and upright.The result of “fusing” the two detectors is illustrated in Fig. 3f, where it canbe clearly seen that original “erroneous” facial regions of both detectors thatcontained background or “irrelevant” pixels (Fig. 3a,e have been corrected).Results are very good, as illustrated in Fig. 4. A schematic description of theoverall detection module is depicted in Fig. 2b.

2.3 Region Based Feature Tracking

The algorithm used for tracking faces (or other regions of interest) is basedon selecting a large number of point features in the tracking region whichare subsequently tracked in the next frames. Tracking is initialized eithermanually or with the output of the detection module, i.e. the bounding box(es)of the area(s) corresponding to the detected face(s). The result of the trackingalgorithm is specified as the bounding rectangle of all the tracked features.Point features are tracked using the Kanade–Lucas–Tomasi (KLT) algorithm[26]. The displacement d = [dx dy]T between two feature windows on videoframes I and J is obtained by minimizing:


ε =∫ ∫

W

[J

(x +

d2

)− I

(x − d

2

)]2

w(x)dx, (4)

where x = [x, y]T , W is the region of the window and w(x) is a weightingfunction. In order to perform one iteration of the minimization procedure of4, the equation Zd = e must be solved, where [26]:

Z =∫ ∫

W

g(x)gT (x)w(x)dx, (5)

e = 2∫ ∫

W

[I(x) − J(x)]g(x)w(x)dx, (6)

and

g =

[∂(I+J)

∂x∂(I+J)

∂y

]. (7)

To eliminate background features from the tracking process, a clusteringprocedure is applied [31]. Let (µx, µy), (σx, σy) be the mean and variance ofthe feature coordinates for all features in frame t and [x, y]T the coordinatesof some feature. This feature is retained in frame t + 1 if xε[µx −σx, µx + σx],yε[µy −σy, µy +σy], otherwise it is rejected. Assuming that the tracked objectfeatures have similar motion patterns, this enables the algorithm to rejectstationary or slowly moving background features, after a number of frames.This is particularly useful if the region used for tracking initialization containsa portion of background, as can be seen in Fig. 3e.

Feature generation is based on the algorithm used for point feature track-ing [26], where a good feature is defined as the one whose matrix Z has twolarge eigenvalues that do not differ by several orders of magnitude. Such a fea-ture assures that equation Zd = e is well conditioned. It can be shown thatthe large eigenvalue prerequisite implies that the partial derivatives ∂(I+J)

∂x

and ∂(I+J)∂y are large [26].

To overcome the problem of feature loss, especially when the amount of mo-tion between two subsequent frames is above average, the number of featuresin each tracked region is checked in each frame against a specified threshold.If the number falls below the threshold, features are regenerated. Feature re-generation also takes place at regular intervals, in an effort to further enhancethe tracking process.

There are cases, however, when tracking failure will occur, i.e. when a faceis lost in a frame. To cope with such problems, re-detection is employed usingthe combined face detection algorithm presented earlier. However, if any ofthe detected faces coincides with any of the faces already being tracked, thelatter are kept, while the formers are discarded from any further processing.Re-detection is also periodically applied to account for new faces entering thecamera’s field-of-view. The schematic description of the tracking module isillustrated in Fig. 2c.


3 Eye Detection

The field of eye detection has been very active in the recent years and a lot ofdifferent approaches have been proposed. Zhou et al. [32] use the idea of gener-alized projection functions (GPF) to locate the eye centers in an approximatearea found using the algorithm presented in [33]. Jesorsky et al. [34] use athree stage technique: First, the face is initially localized using the Hausdorffdistance [35], second a refinement is performed, taking into account the esti-mated area of the face and, third, the multi-layer perceptron (MLP) is appliedfor a more exact localization of the pupils. Cristinacce et al. [36] used a multi-stage approach to detect 17 features on the human face, including the eyes.First, a face detector is applied, then the pairwise reinforcement of feature re-sponses (PRFR) are applied to detect features. Refinement was made using aversion of the indexactive appearance model (AAM) active appearance model(AAM) search.

We will present a new method [37] for eye detection, which detects theeyes region on a face based on geometric information from the eye and thesurrounding area. This method performs better due to the fact that pixelintensity information might prove unreliable, because of the varying illumina-tion conditions as well as eyes details.

3.1 Eye Region Detection

Method Overview

In [38], the standard PCA method was applied on the intensity of facial imagesto derive the so called eigenfaces for face recognition purposes. Here, the sameidea is used for eye region detection, on data of different nature, though. TheCanny edge detector [39], is used because it can be adjusted so that it findsedges of varying intensity. For each pixel, a vector pointing to the closestedge pixel is calculated. The magnitude (length) and the slope (angle) ofeach vector are the two values assigned to each pixel. Thus, instead of theintensity values, the vector length and angle maps for an image are producedas shown in Fig. 5. PCA is then applied on a set of training eye images toderive eigenvectors for these maps. Subsequently, in order to detect the eyes ona facial region, the length and angle maps of candidate regions are projected,on the subspaces spanned by the eigenvectors, found during training and thesimilarity of projection weights to those of model eyes, is used to declare aneye presence.

Extraction of Eigenvectors Using Training Data

The training images are scaled to the dimensions N × M and the Cannyedge detector is therefore applied. The corresponding vector length and vec-tor angle maps form two separate N × M matrices which can be alterna-tively considered as two one-dimensional vectors of dimension L = NM .


(a) (b)

(c) (d)

(e) (f)

Fig. 5. Training data. (a) The left eye training images. (b) The right eye trainingimages. (c) The length maps for the right eye. (d) The length maps for the rightimage eye. (e) The angle maps for the left eye image. (f) The angle maps for theright eye image

Therefore, a normalization of these matrices, is applied, by subtracting,from each of them, the respective average matrix. Then, PCA is applied onthe length and angle maps of the training images resulting in eigenvectorsURa,i, URl,i, ULa,i, ULl,i, 1 < i < K (where K the cardinality of the trainingset) that correspond to the angle and length maps of the right and left eye.The dimension of each eigenvector is NM . Despite the fact that the use of thestandard PCA on angular data is not very well grounded, due to their periodicnature, the obtained results are very good and much better than using onlylength information.

Eye Region Detection Framework

Prior to eye detection, a face detector has to be applied on the video frame.The detected face area is then scaled to certain dimension (i.e. pixels). Sub-sequently, edge detection is performed. Since edges, related to eyes and eye-brows, are among the most prominent in a face, the parameter values of theCanny detector can be set as follows high threshold = 50, low threshold = 25,sigma = 1 [40]. This way only the most significant edges will be detected asshown in Fig. 6. A visual representation of the vector length and angle mapsof the face can be seen in Fig. 6c,d. All the areas of size N × M are exam-ined and the vectors containing the weights that project the length and anglemaps ΦRa, ΦRl, ΦLa, ΦLl of the area on the corresponding spaces spanned bythe eigenvectors calculated during the training stage are found. The projec-tion vector elements wRa,i, wRl,i, wLa,i, wLl,i for a given N ×M area are foundas follows:

wRa,i = URa,iT ΦRa, (8)


(a) (b) (c) (d)

Fig. 6. (a) Detected face, (b) thresholded Canny edge detector output, (c) vectormagnitude map, (d) vector slope map

wRl,i = URl,iT ΦRl, (9)

wLa,i = ULa,iT ΦLa, (10)

wLl,i = ULl,iT ΦLl. (11)

At the end, each area is represented by two k-dimensional vectors, com-posed by the weights needed to project the NMdimensional vectors of anglesand lengths on the respective k-dimensional space. To proceed with the de-tection, artificial eye-model templates can be used, of dimensions N × Mpixels. The same procedure is followed for each of these two templates, i.e.the vector length map and vector angle map are derived and the respectiveprojection weights wRa,Model, wRl,Model, wLa,Model, wLl,Model are calculated.For each region of size N × M pixels within the facial area, the projectionweights wRai, wRli, wLai, wLi, are therefore compared to those of the modeleyes using the L2 norm:

L2,R = ||wRa − wRa,Model|| + ||wRl − wRl,Model||, (12)

L2,L = ||wLa − wLa,Model|| + ||wLl − wLl,Model||. (13)

The facial areas with the smallest distance from the model eyes are the onesat which the eyes are located. To make the algorithm faster we utilized theknowledge of the approximate positions of eyes on a face. Thus, we searchedfor the eyes only in a zone in the upper part of the detected face.

3.2 Eye Center Localization

In order to localize the eye center, we first apply the Canny edge detector withparameters high threshold = 13, lowthreshold = 3, and sigma = 1 that resultin the detection of even weaker edges. This is done in order to handle cases ofpeople wearing glasses. In such a case, severe reflections on the glasses mightmake eye characteristics less visible.

Within this area, we search for the three pairs of lines (three horizontal andthree vertical) having the most intersections with edges (Fig. 7). The intersec-tion of the horizontal line that has the medium number of edge intersections


Fig. 7. Eye center localization

among the three selected horizontal lines with the vertical line with the samecharacteristic (black lines in Fig. 7) was found to give a very good approxima-tion of the eye center. We can further refine this result by applying the samesearch method (using vertical and horizontal lines) around the found pointwithin a smaller area. For even more refined results, the information that theiris of the eye appears to be the darkest area near the point found at the previ-ous step can be used in order to place the eye center exactly in the iris center.Furthermore, we can use the position of the eye center found in the right eyearea to locate the left eye center. Due to the facial symmetries, the horizontalposition of the left eye with respect to the upper and lower boundary of theeye area should be similar to that of the right eye. Using this observation, wecan search for the left eye’s center within a region, centered with respect tothe vertical dimension, around the vertical position of the right eye in its area.

4 Visual Speech Recognition

Speech analysis systems have attracted increased attention in recent researchefforts. At first, the focus was solely on the audio information. However visualcues are currently being incorporated in speech analysis systems, providingsupplementary information in the analysis process. In [9], the authors arguethat a major improvement can be obtained by using joint audio-visual process-ing, compared to the sole processing of the audio information. Indeed, seeingthe face of a speaking person facilitates the intelligibility of the speech, partic-ulary in noisy environments. Laboratory studies have shown that visual infor-mation allows a tolerance of an extra 4 dB of noise in the acoustic signal [10].This is a significant improvement considering that each dB of signal-to-noiseratio is reflected into a 10–15% error reduction in the intelligibility of completesentences [41].

The main research topic in this area is automatic visual or audio–visualspeech recognition (ASR) [42]. Methods for speech intent detection for human–computer interaction [43] and multi-modal determination of speaker locationand focus [44] have been also proposed.

In human-to-human interaction, lip-reading performance depends on anumber of factors [9]. Viewing conditions affect the quality of the visualinformation. For instance, poor lighting causes difficulties in determining the


mouth shape. Furthermore, as the speaker and the listener move further apart,it becomes more difficult to observe important visual cues. Finally, the view-ing angle has a major effect on the recognition performance. Inevitably, theselimitations are inherited into automatic visual speech analysis systems.

In this paragraph, we present a statistical approach for visual speech de-tection, using mouth region intensities. Our method employs face and mouthregion detectors, applying signal detection algorithms to determine lip activ-ity. The proposed system can be used for speech intent detection and speakerdetermination in human–computer interaction applications, as well as in videotelephony and video conferencing systems. It can also be used as a componentin a dialogue detection system for movies and TV programs. Such a systemcan be useful in multimedia data management or semantic video annotationapplications.

4.1 Motivation

In [45] a method based on the significant variation of the intensity values of themouth region that a speaking person demonstrates, is presented. Specifically,as it can be seen in Fig. 8, the opening of the mouth produces a radical increasein the number of pixels with low intensity values due to the shade in the oralcavity. Therefore, we argue that a large number of the mouth region pixelsthat exhibit low intensity values can indicate lip activity and that this factcan be used for the visual detection of speech.

We denote by x the number of the low intensity pixels of the mouth regionat a single video frame. In particular, x is the total number of the pixels in themouth region whose grayscale value is below an intensity threshold TI . Sincevideo excerpts from different movies, TV programs, or personal cameras are

(a) (b)

0

5

10

15

20

25

30

35

40

45

50

0 50 100 150 200 250

0

5

10

15

20

25

30

35

40

45

50

0 50 100 150 200 250

(c) (d)

Fig. 8. Increase in the number of low intensity pixels in the mouth region whenmouth is open. (a) Closed mouth. (b) Open mouth. (c) Closed mouth histogram.(d) Open mouth histogram


x (N

o of

pix

els

with

low

inte

nsity

val

ues)

0

50

100

150

200

250

10 20 30 40 50 60 70 80 90 100Frames

Fig. 9. Distribution of the number of low grayscale intensity pixels of a videosequence. The rectangle encompasses the frames where the person is speaking

acquired in diverse lighting conditions, we do not apply a global threshold forall videos, but a video specific threshold, computed prior to the analysis ofeach video sequence. In order to normalize the value of x for different sizes ofthe bounding box of the mouth region, we divide its value with the area of thebounding box. Thus, for a video sequence that consists of M frames, we createa discrete sequence x[n], 0 ≤ n ≤ M − 1.

In Fig. 9 we depict x[n] for a video sequence displaying a person that issilent at first, speaking for a number of frames (the frames included in therectangle drawn in Fig. 9) and then silent again. It is obvious that x[n] obtainsmuch higher values when the person is speaking. Moreover, x[n] exhibits alarger deviation of its values in frames where a person is speaking, due to themoving lips that affect the visible area of the mouth cavity. For instance, atframe 39, x[n] takes a very small value, even smaller than the values of someof the silent frames. This is because at this particular instance the personspeaking has his lips joined together to produce the letter “m”. In the silentframes, the values are much lower (in average) and exhibit a small deviationfrom their mean value. A statistical approach for the efficient detection ofvisual speech can be done, by exploiting the attributes that a video sequenceof a speaking person exhibits, in particular the use of

• The increased values of x[n]• The large deviation of x[n]

which are present at the video frames where a person is speaking.

4.2 System Overview For Visual Speech Recognition (VSR)

A typical VSR system consists of three main parts:

• Face detection• Mouth region detection• Visual speech detection.


Before applying our visual speech detection algorithm, we first have to detectthe face in the video sequence under examination, and then assign at eachframe a bounding box encompassing the mouth region of the detected face. Aface detector which is based on the techniques presented in [25,46,47] can beused for this task.

For the detection of the mouth region we use the technique described in [48]for eye detection, modified to detect mouth regions in facial images. In [48],each pixel is assigned the slope and the magnitude of the vector from the pixelto the closest edge point. Thus, a slope and a magnitude map are formed foreach candidate region. Eye detection is performed by comparing these mapsagainst the corresponding map of an eye model, in a suitable space derivedthrough PCA. A similar approach, employing a mouth model is applied formouth region detection.

The visual speech detection system is based on statistical algorithms, usedin signal detection applications. At first the intensity threshold is determined,as half the average intensity of the mouth region in the first frame, and the dis-tribution of the number of pixels below it is computed. The intensity thresholdis increased iteratively when it can not provide sufficient information aboutthe intensity values of interest, i.e. when the threshold is low and the numberof the selected pixels is inadequate. The speaking and non-speaking intervalsare determined by applying an energy detector and an averager to a slidingwindow, which moves frame-by-frame, spanning the whole video sequence.The combined outcome of the detectors (for every window) is compared to athreshold in order to determine the presence of visual speech. This thresholdis computed according to the Neyman–Pearson lemma for each video sequenceand it depends on the distribution of the silent frames.

4.3 Visual Speech Detection Algorithm

The efficient determination of speaking and non-speaking intervals is basedon statistical signal processing principles, incorporating detection theory al-gorithms. Our aim is to decide between two possible hypotheses; visual speechpresence versus visual speech absence. We can translate our hypotheses into aproblem of signal detection within noise. We consider as noise the value of xwhen the mouth is closed, i.e. the value corresponding to the area of the lipsand as signal the contribution of the area of the oral cavity that is revealedwhen a person is speaking to the value of x. Hence, in both hypotheses there isnoise present (the pixels of the lip area), whereas, when the person is speakingthere is signal present as well. Consequently, our hypotheses can be stated asfollows:

H0: Noise only (visual silence)H1: Signal and noise (visual speech)

Both the signal and the noise samples are obtained as the sum of a num-ber of pixels whose intensity is below TI . Thus, according to the central limit


theorem, we can consider that the data samples x[n] follow Gaussian distrib-utions under both hypotheses. Therefore, in order to discern between visualspeech and silence, we can apply the detection theory principles for detectinga Gaussian random signal in white Gaussian noise. We assume that the signals[n] is a Gaussian process with variance σ2

s and mean µs and the noise w[n]is zero mean white Gaussian, with variance σ2. We have to note that actuallythe distribution of w[n] is not zero mean. However, we can convert it to zeromean by estimating the mean value of the noise samples, as presented in thefollowing subsection. Consequently, our detection problem can be described as

H0 : x[n] = w[n], n = 0, 1, . . . , N − 1,

H1 : x[n] = s[n] + w[n], n = 0, 1, . . . , N − 1,

where w[n] ∼ N(0, σ2), s[n] ∼ N(µs, σ2s) and s[n] and w[n] are independent.

Hence, the signal can be discriminated from the noise, based on its mean andcovariance differences.

We therefore define the N × 1 random vector x, consisting of the randomvariables [x[0], x[1], . . . , x[N−1]]T . The Neyman–Pearson lemma states that inorder to maximize the probability of signal detection PD for a given probabilityof false alarm PFA, we decide for H1 if the likelihood ratio (L(x)) is largerthan a threshold γ:

L(x) =p(x;H1)p(x;H0)

> γ, (14)

where p(x;H0), p(x;H1) are the multivariate probability density functions forthe respective hypotheses.

From our modelling assumptions, x ∼ N(0, σ2I) under H0 and x ∼N(µs1, (σ2

s + σ2)I) under H1, where 0 and 1 denote the all-zero and all-one vectors, respectively. Thus, substituting the density functions in 14, ma-nipulating the likelihood ratio and incorporating the non-data terms in thethreshold, we have [49]:

T (x) = Nµs · 1N

N−1∑n=0

x[n] +σ2

s

2σ2·

N−1∑n=0

x[n]2, (15)

which is the weighted sum of an averager (1/N)N−1∑n=0

x[n], which attempts to

discriminate between the two hypotheses on the basis of the sample mean, and

an energy detectorN−1∑n=0

x2[n], which attempts to discriminate on the basis of

the variance. Hence, by applying these two detectors, we can detect visualspeech by exploiting the attributes that a speaking person demonstrates.

The averager is used to detect a DC level in the presence of zero meanGaussian noise. The detector compares the sample mean to a threshold. Thevalue of the threshold γ′ is found by constraining PFA. The probability offalse alarm of the averager is given by


PFA = Pr{T (x) > γ′;H0} = Q

(γ′√σ2/N

), (16)

where Q is the right tail probability of a Gaussian random variable. Hence,the threshold can be found from:

γ′ =

√σ2

NQ−1 (PFA) , (17)

where Q−1 is the inverse right-tail probability.

The energy detector, T (x) =N−1∑n=0

x2[n], is used to detect a random

Gaussian signal in zero mean Gaussian noise. The detector computes theenergy of the data samples and compares it to a threshold. If the signal ispresent, the energy of the data is large. Again, the value of the threshold isfound by constraining PFA. The probability of false alarm can be found bynoting that under H0,

T (x)σ2 is distributed according to a chi-squared distrib-

ution. The right-tail probability function of a chi-squared random variable isexpressed as Qχ2

N(x). Therefore, the probability of false alarm is:

PFA = Pr{T (x) > γ′;H0}

= Pr

{T (x)σ2

>γ′

σ2;H0

}= Qχ2

N

(γ′

σ2

).

Thus, the threshold is given by

γ′ = σ2Q−1χ2

N(PFA). (18)

Consequently, our aim to detect visual speech based on the increased valuesand large variance of x[n] can be accomplished employing an averager and anenergy detector. We apply the two detectors to a sliding window, consisting ofN frames, which moves frame-by-frame spanning the whole video sequence.The combined outcome of the detectors (for every window) is compared totheir combined threshold and a decision for the presence of visual speech isobtained.

However, we have not completely resolved the problem yet, since in theaforementioned case the noise standard deviation, which is involved in thresh-old determination, and the noise mean, required to convert the noise into azero mean process, are not known a priori.

Noise Estimation

In the preceding analysis we have assumed zero mean Gaussian noise andwe have concluded that the noise standard deviation is a prerequisite for thecomputation of our threshold. In order to find the actual values of the noisestatistics, we apply an estimation algorithm based on the detection theoryprinciples we have presented.


The philosophy of the estimation algorithm focuses on distinguishing effi-ciently the signal and noise samples from the noise only samples, and then oncalculating noise µ and σ from the noise samples. This is achieved iteratively,by applying the average and the energy detector to our data sequence, eachtime with refined estimates of the noise statistics, until they converge to theirfinal values. The better the distinction between the noise and the signal sam-ples, the better the results that the noise statistics estimation should yield.This approach, referred to as an estimate and plug detector [49], suffers fromthe possibility that the estimation will be biased if a signal is present.

The algorithm first computes initial estimates of µ and σ, in order to ap-ply the detectors. The initial estimates are computed from the smaller 10%of the data set values, assuming that these values belong to the noise sam-ples. Thereafter, we apply the detectors to our data set, employing the noisecharacteristics we have computed. The detectors distinguish the noise onlysamples from the signal and noise samples and new noise characteristics arecomputed. This process is repeated until the difference between two consecu-tive estimations of σ is smaller than 10−2.

The stages of the noise statistics estimation algorithm for a video sequenceare displayed in Fig. 10. It is obvious that the initial values of the noise sta-tistics result in a modest estimation of the noise (Fig. 10b) and only a portionof the noise samples is identified. These noise samples, however, are used toobtain a better estimation of the noise characteristics. After two more itera-tions of the algorithm (Fig. 10c,d), where every time more noise samples are

20 40 60 80 100 120 140 160

x (N

o of

pix

els

with

low

inte

nsity

val

ues)

0

50

100

150

200

Frames

x (N

o of

pix

els

with

low

inte

nsity

val

ues)

0

50

100

150

200

20 40 60 80 100 120 140 160Frames

(a) (b)

x (N

o of

pix

els

with

low

inte

nsity

val

ues)

0

50

100

150

200

20 40 60 80 100 120 140 160Frames

x (N

o of

pix

els

with

low

inte

nsity

val

ues)

0

50

100

150

200

20 40 60 80 100 120 140 160Frames

(c) (d)

Fig. 10. Noise statistics estimation steps. Dark values: signal and noise presence,bright values: only noise presence. (a) Data set. (b) Noise estimation: first iteration.(c) Noise estimation: second iteration. (d) Final noise estimation – signal detection


identified and better estimations of the noise characteristics are obtained, thenoise only samples are efficiently distinguished. Hence, in the final step, anaccurate estimation of the noise statistics is available.

It should be noted here that the visual speech detection procedure outlinedin this section involves certain assumptions as well as small deviations fromthe statistical detection theory formulae.

5 3D Face Reconstruction and Facial Pose Estimationfrom Uncalibrated Video

The task of reconstructing an object in 3D space from its images (projections)is one of the most demanding in computer vision. In recent years the biggestattention was given to the calibrated reconstruction case (i.e. the case wherethe position of the camera relative to the object and the camera intrinsicparameters are known beforehand) whereas nowadays researchers try to tacklethe uncalibrated reconstruction problem, where the input images are takenwith a camera at random position and orientation with respect to the humanface.

It is well known [50] that utilizing the epipolar geometry one can yielddepth estimates for an object just from two of its images. Unfortunately, theobtained coordinates do not lie on the Euclidean space [12], which makes thisrepresentation not very useful. In order to upgrade the representation, extrainformation is required. This extra information can be obtained either fromthe camera position or from the camera intrinsic parameters. The latter canbe calculated either from the use of special calibration patterns or from theimages of our input set. The procedure of utilizing the images that we havein order to calculate the camera intrinsic parameters is called self-calibrationas opposed to calibration where some specific calibration patterns are used inorder to calculate the camera calibration matrix.

There are numerous approaches to the uncalibrated 3D reconstructionproblem in literature, the more characteristic of which are the work ofFaugeras [51], Beardsley et al. [52], Hartley [53] and Pollefeys, who wrotean excellent tutorial on the subject [12].

In this paragraph a 3D reconstruction algorithm presented by Pollefeysin [12] to calculate the 3D coordinates of some salient feature points of the facebased on a small number of facial images where feature points are manuallymarked, is presented [12]. We have chosen to use this approach because ofits flexibility, due to the fact that the input images can be taken with anoff the self-camera placed at random positions. Basically, such images cancorrespond to video frames of a face taken from different view angles, providedthat the face does not change expression nor speaks. The intrinsic cameraparameters can be calculated from the input image set. Once the camera hasbeen calibrated vs. the face coordinates system, it is very easy to estimate theface pose vs. the camera.


We further incorporate a generic face model (the Candide face model) [54]and deform it, using a finite element method (FEM), based on the point cloudobtained from the first step. On top of that, to further improve one can re-project it back to the initial images and fine tune it manually. The resultingface model can be used along with the corresponding texture in biometricapplications, such as face recognition and face verification.

5.1 3D Reconstruction

The algorithm proposed by Pollefeys in [12] can be used to calculate the3D coordinates of some salient features of the face. We will briefly explainthe steps of algorithm for the sake of completeness of this section. Readersinterested in obtaining additional information can consult [12].

First we have adopted the ideal pinhole/perspective camera model [11]where no radial distortion is present. In such a camera, the projection of anobject point on an image plane is described by the following equation

m = P · M, (19)

where m = [x, y, 1]T are the point coordinates on the image plane, P is the3×4 projection matrix and M = [X,Y,Z, 1]T are the object point coordinatesin 3D space. We use the homogenous coordinates where the = sign indicatesan equality up to an non-zero scale factor.

A manual selection of some salient feature points of the face in the inputimages and the definition of their correspondences is then applied. The coor-dinates of these feature points over the input images constitute the input tothe 3D reconstruction algorithm. It has to be noted that we always use someeasily recognizable and distinct feature points of the face such as the eyes andmouth corners and the tip of the nose. Unfortunately, it is very difficult todefine a big number of feature points on the human face due to its lack of tex-ture and characteristic points that can be uniquely identified over a numberof images.

Therefore, we calculate the fundamental matrix [50] based on the first twoimages of the set. These two images must be selected efficiently so that theycorrespond to viewpoints that are as far apart as possible but in the same timehave all the feature points visible on both of them. The overall performance ofthe algorithm relies heavily on the efficient selection of these first two frames.

After the calculation of the fundamental matrix it is possible to obtain areference frame which will eventually help us get an initial estimate of thedepth for the selected feature points. Unfortunately, this representation doesnot lie in the metric space and, thus, additional procedures should be followedin order to upgrade it to a metric one. The rest of the images of the input setare incorporated in the algorithm and the projection matrices that describethe projection of the face in each image of the set are evaluated.


In the subsequent step, the algorithm performs an optimization which isbased on all the images of the input set and, thus, refines the representation.This is called Bundle Adjustment [55] and it is the most computationallyintensive part of the algorithm.

Finally the algorithm uses a self-calibration technique in order to calculatethe camera intrinsic parameters. These parameters are subsequently used toupgrade the representation to the metric space and yield the final cloud of 30face points.

5.2 Generic Model Deformation

The incorporation of a generic face model, namely the Candide face model,into the reconstruction procedure, is used. The Candide face model has beendeveloped by the Linkoping University [54] and in its current version has 104nodes, distributed all around the human face and 184 triangles that connectthose nodes creating a wire frame. The nodes of the model correspond tocharacteristic points of the human face, e.g. nose tip, outline of the eyes,mouth, etc. The feature points selected on the facial images are described inSect. 5.1 and should correspond to Candide nodes. A procedure for definingthe correspondences between the 3D reconstruction of the selected featurepoints and the Candide model nodes was followed.

FEM Deformation

A mass spring finite element method was employed to deform the genericCandide model. The deformation process incorporates a list of pivotal points(3D reconstructed points from the first part of the algorithm), the Candidemodel and a list which contains the correspondences between the pivotalpoints and the Candide nodes, and produces a deformed model.

The FEM deformation can be outlined as follows: at first the Candidemodel undergoes global rotation translation and scaling so that it is roughlyaligned with the cloud of 3D points. In order to determine the scale factorthe mean distances between the two corners of the eyes and the two cornersof the mouth were evaluated both in the point cloud and the Candide modeland their ratio was used as the scale factor. Then the model was translatedso that the center of mass of the point cloud coincides with the center of massof the corresponding model nodes.

Furthermore the Candide model has to be appropriately rotated. Toachieve this, a triangle whose vertices are the outer tips of both eyes andthe tip of the nose was defined. The same triangle was defined for the cor-responding nodes of the Candide model and the model was rotated so thatthe outwards pointing normal vectors of the two triangles are aligned. Thedeformation process moves the corresponding nodes of the Candide model sothat they coincide with the points of the cloud and deforms the rest of thenodes. As it is obvious from the latter, the pivotal points must spawn theentire face, otherwise the deformation process will produce poor results.


Manual Refinement

After the application of the deformation we obtain a model that fits the in-dividuals face depicted in the input set of images. Unfortunately, due to lim-itations on the 3D reconstruction algorithm, the deformation process and toerrors in the selection of the feature points coordinates, the output model maynot be ideal, in the sense that some nodes may not have the correct positionin 3D space. Therefore a manual refinement procedure is adopted.

According to this procedure, we reproject the deformed face model inevery image of the input set and manually change the location of certainmodel nodes in the 2D domain. In order to return to the 3D domain fromthe manually refined projections, a triangulation process is used [12]. Thiswas facilitated from the fact that the projection matrices for each frame wereavailable from the 3D reconstruction algorithm. In order to be able to usethe triangulation method to estimate the 3D coordinates of a models nodewe must specify manually the new positions of the nodes in two frames. Bydoing so we can yield new, improved coordinates, in the 3D space. When themanual adjustment of the selected nodes is finished the deformation processis applied once again but this time with an extended set of pivotal points. Theinitial cloud of points produced from the 3D reconstruction algorithm alongwith the additional 3D coordinates of the points that have been manuallyrefined. The manual refinement procedure which depicts the projection of thedeformed model into an image of the input set prior and after the manualrefinement. It is evident that with the manual refinement the generic modelcan fit more efficiently to the individual’s face.

6 Face Clustering

Face clustering is an important application for semantics extraction on videoand can be used in a multitude of applications in video analysis like deter-mining the number of actors in a video or dialog detection. Until now someinteresting algorithms have been proposed in [15, 16, 56], but most of themare based in calibrated face images from news or face recognition databaseslike [16]. In [13] we have proposed a new method for face clustering using themutual information as a criterion and also make use of a larger, thus moreinformative, feature vector associated in each face image.

6.1 Mutual Information for Face Clustering

Mutual information is defined as the information shared between two distrib-utions X and Y . Let us define their joint entropy as:

H(X,Y ) = −∑

p(x, y) log(p(x, y)), (20)


where p(x, y) the normalized (summed to one) probability density function ofthe common information of distribution X and Y . In the same way, we definethe Shannon’s entropy for X and Y as:

H(X) = −∑

p(x) log(p(x)), (21)

H(Y ) = −∑

p(y) log(p(y)). (22)

Then we can define the mutual information as:

I(X;Y ) = H(X) + H(Y ) − H(X,Y ), (23)

or equivalently:

I(X;Y ) =∑

x

∑y

p(x, y) logp(x, y)

p(x)p(y). (24)

I(X;Y ) is a quantity that measures the mutual dependency of two randomvariables. If we use a logarithm with base 2, then it is measured in bit. Thisquantity needs to be somehow normalized in order to create a uniform metricbetween different images and therefor be used as a similarity measure. Forthis reason, we use the normalized MI, which is defined as the quotient of thesum of two entropies of X and Y , by their joint entropy.

NMI(X;Y ) =H(X) + H(Y )

H(X,Y ). (25)

Is is also useful to notice that:

NMI(X;Y ) =H(X) + H(Y )

H(X,Y ), (26)

NMI(Y ;X) =H(X) + H(Y )

H(Y,X). (27)

As we know from (20):H(X,Y ) = H(Y,X). (28)

Thus,NMI(X;Y ) = NMI(Y ;X). (29)

A detailed explanation of the mutual information normalization can be foundin [57].

In order to calculate the joint entropy between the two images we con-struct a 2D histogram of 256 bins which take in account the relative positionsof intensities so that similarity occurs between two images, when same in-tensities are located in same spacial locations. Less literarily, the 2D joint


histogram is calculated as follow: Let A and B be the first and the secondimage, respectively, of size N1 × N2 then:

Hist(i, j) = |{(k, l) ∈ N1 × N2 | A(k, l) = i and B(k, l) = j}|, i, j ∈ [0, 255],(30)

where | · | denotes the cardinality of a set.By defining the joint histogram that way, we have to admit that in order

to calculate it, images have to be of same size. This means that one has toresize one image to the other’s dimensions. In our approach, and in orderto equalize big scaling interpolation issues, we define a mean bounding boxwhich is calculated from all bounding boxes that the face detector provides tous. This approach shows better results than if we scale every pair of imagesforward the bigger or the smaller of them. So every image is scaled towardsthis mean bounding box before the mutual information calculation.

Another issue is the fact of anisotropic scaling. Once we have put thedetector’s outputs in the same scale we calculate the NMI for different framesof the target face image. We start varying the bounding box’s width andheight from 80 to 120% of the initial mean bounding box, with a step of 5%.This way, we eliminate scaling problems due to detector’s errors. In Fig. 11one can see two images which show the aforementioned case. Finally, we takethe maximum of the calculated NMIs between the two images for the differentscales.

As mentioned before, the movies’ context is dominated from several dif-ficulties in order to extract content information. Mutual information is un-dertaking most of these problems. By using the scale variance within thedetectors results and the point-to-point approach of the joint entropy morerobust results can be achieved in a very complicated task. In [14] the problemis tackled based on a preprocessing of the image.

Fig. 11. In this image one can see that images are of different scales but faces arepractically of same size


6.2 Mutual Information Vectors

We create a vector of MIs for every image. The dimension of that vector isequal to the size of the face detection results’ data set. For every face imagein the results set we calculate the NMI between this image and any other, andtherefore we create a vector v. All those vectors results in an M × M matrix(where M the cardinal of the set of all detections from a video sequence)where every row i of that matrix will be the NMI of the ith detection withall other images.

A(i, j) = NMI(FaceImagei, FaceImagej). (31)

It is obvious that the elements of the diagonal will have value one, which isthe normalized mutual information of a face image with itself and also the ma-trix will be symmetric w.r.t the main diagonal. The diagonal property of thematrix is a forward effect of the MI symmetry shown in (29). Those propertiesare very helpful because they drastically intervene in the time complexity ofthe algorithm. By using those properties the time complexity is minimizedby a multiplicative factor of 0.5 and an additive factor of −M . In Fig. 12 onecan see the image of a similarity matrix A. In this figure a test of consecutiveappearances of two different actors is shown. One has to notice the squareregions that appear in that image and that way we can understand that same

Fig. 12. Darker regions belongs to the first actor and clearer ones to the secondactor. The video sequence has four consecutive shots in the order FA-FA-SA-SAwhere FA and SA first and second actor, respectively


persons appear. The thin lines that appears are in most cases detectors falseresults which are very different from the face pattern.

6.3 Clustering Process

In order to cluster the similarity matrix we can use the fuzzy c-means al-gorithm. This method has been proven that in situation where we have alight mixture of classes’ elements, it performs better than the simple k-meansalgorithm.

In order to use this algorithm we define every row of the aforementionedmatrix M as a different vector in an M -dimensional L2-normed vector spaceover R. In Figs. 13 and 14 one can see how those vectors are formed for twoexamples.

Therefore, we use the Euclidian distance to calculate distances betweenthe vectors

dist(vi,vj) =

√√√√ M∑k=1

(vik− vjk

)2 (32)

and by those means to calculate a predefined number of clusters’ centers. Adetailed implementation of the FCM algorithm can be found in [58].

Initialization has a significant role for FCM. So in order to provide betterresults the first centers can be manually selected in a way that faces that

Two NMI Vectors

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 100 200 300 400 500 600 700 800

Fig. 13. Two vectors which belong to different clusters. The picks at 128 and 622define the mutual information of the images with themselves


0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 100 200 300 400 500 600 700 800

Fig. 14. Two vectors which belong to the same cluster. The picks at 120 and 660define the mutual information of the images with themselves

corresponds in different actors constitutes a different initial center. A ran-dom selection of initial centers vary the results of a factor of 0.5% of falseclassification.

7 Face Recognition

Face recognition has attracted the attention of researchers for more than twodecades and is among the most popular research areas in the field of computervision and pattern recognition. Once we have performed face clustering overthe face appearances in a movie, we may wish to assign person ids (e.g. actornames) to each face cluster. If we have a database containing images of actors,it can be used to train a face recognition algorithm. Then, when we present aface cluster (or one or some of its images) to this algorithm, it can return themost probable actor id. There is extensive literature on face recognition [19].Here we try to summarize some of the most frequently used methods.

The most popular among the techniques used for frontal face recognitionare the subspace methods. Subspace methods project the original high dimen-sional image space into a low dimensional one. The classification is usually per-formed according to a simple distance measure in the final multi-dimensionalspace. Two of the most well studied subspace methods for face recognition arethe eigenfaces [59] and the fisherfaces [60]. The main limitation of subspacemethods is that they require perfect alignment of the face images in order tobe functional.


Another popular class of techniques used for frontal face recognition iselastic graph matching. Elastic graph matching is a simplified implementa-tion of the dynamic link architecture (DLA) [61]. DLA is a general objectrecognition technique that represents an object by projecting its image ontoa rectangular elastic grid where a Gabor wavelet bank response is measuredat each node [62]. A variant of elastic graph matching based on multiscaledilation-erosion, the so-called morphological elastic graph matching (MEGM)was proposed and tested for frontal face recognition [63].

7.1 Subspace Methods

Let M be the number of samples in the image database U = {u1,u2, . . . ,uM}where ui ∈ �n is a database’s image. A linear transformation of the originaln-dimensional space onto a subspace with m-dimensions (m $ n) is a matrixWT ∈ �m×n. The new feature vectors yk ∈ �m are given by

yk = WT (uk − u), k ∈ {1, 2, . . . ,M}, (33)

where u ∈ �n is the mean image of all samples.One of the oldest and well studied methods for low dimension represen-

tation of faces is the eigenface approach [64]. This representation was usedin [59] for face recognition. The idea behind the eigenface representation is tochoose a dimensionality reduction linear transformation that maximizes thescatter of all projected samples. The matrix that is connected to the scatterof multi-dimensional data is the total scatter matrix ST ∈ �n×n defined as:

ST =M∑

k=1

(uk − u)(uk − u)T . (34)

The transformation matrix, WTe , is chosen to be the one that maximizes the

determinant of the total scatter matrix ST of the projected samples, i.e.

We = arg maxW

|WT ST W| = [w1 w2 . . . wm], (35)

where wi ∈ �n is the eigenvector that corresponds to the ith largest eigen-value of ST . The matrix ST is obviously a very high dimensional matrix.Thus, straightforward calculation of eigenvectors of ST is not feasible. Fortu-nately, due to the fact that its rank is less or equal to M − 1, there are somecomputational inexpensive ways to compute it [59].

The transformed feature vectors yk, produced by this dimensionality re-duction method, are called most expressive features because they best expressthe population [64,65]. The main drawback of the eigenfaces approach is that itdoes not deal directly with discrimination between classes. In order to use theinformation of how the facial data are separated to different classes, Fisher’slinear discriminant (FLD) is used to produce the linear transformation. Let


that each image ui, in the image database U, belongs to one of the C personclasses {U1, U2, . . . , UC}. Let the between-class scatter matrix be defined as:

SB =C∑

i=1

Ni(ui − u)(ui − u)T (36)

and the within-class scatter matrix be defined as:

SW =C∑

i=1

∑uk∈Ui

(uk − ui)(uk − ui)T , (37)

where ui is the mean of class Ui, and Ni is the cardinality of class Ui. Thegoal of the linear transformation WT

f is to maximize the between class scatterwhile minimizing the within class scatter, i.e.

Wf = arg maxW

|WT SBW||WT SW W| = [w1 w2 . . . wm]. (38)

The advantage of using ratio (38) is that if SW is not singular then (38)is maximized when the column vectors of the projection matrix, Wf , are theeigenvectors of S−1

W SB . For a face database with C classes and M total images,the rank of SW is at most M −C and the rank of SB is at most C − 1. Thus,there are at most C−1 eigenvectors that correspond to non-zero eigenvalues ofS−1

W SB . To cope with the fact that SW has rank (M−C)<<n, fisherfaces whereproposed in [60]. Fisherfaces, is a two step dimensionality reduction method.First the feature dimensionality is reduced to M − C dimensions using theeigenfaces approach in order for SW to become non-singular. After that, thedimension of the new features is reduced further using the criterion (38).The total dimensionality reduction transformation to l ≤ C − 1 dimensions is

WTt = WT

f WTe ∈ �l×n, (39)

where WTe and WT

f are the first and the second dimensionality reductiontransformations, respectively. In [66], it was shown that Fisherfaces outper-form eigenfaces only when large and representative training data sets areavailable. The main problem of subspace methods is that they require thefacial images to be perfectly aligned [66]. That is, all the facial images shouldbe aligned in order to have all the fiducial points (e.g. eyes, nose, mouth,etc.) represented at the same position inside the feature vector. For this pur-pose the facial images are very often aligned manually and moreover they areanisotropically scaled. Perfect automatic alignment is, in general, a difficulttask to be assessed.

7.2 Morphological Elastic Graph Matching

A technique for face recognition that does not require perfect alignment inorder to perform well is the elastic graph matching [62, 63, 67] algorithm.


Recently it was shown that morphological elastic graph matching combinedwith support vector machines had very good performance for frontal faceauthentication [68]. A more detailed description of elastic graph matchingfollows.

The facial image region is analyzed and a set of local descriptors extractedat the node of a sparse grid, is created. There are various types of gridsproposed in the literature [62,63,67]. The simplest is an evenly distributed gridover a rectangular image region. This type of grid was used in the experimentspresented in this paper. In all cases, the first step of the elastic graph matchingalgorithm is to build an information pyramid in the reference face image. In themorphological elastic graph matching this information pyramid is build usingmultiscale morphological dilation-erosions [69]. Given an image f(x) : D ⊆Z2 → � and a structuring function g(x) : G ⊆ Z2 → �, the dilation of theimage f(x) by g(x) is denoted by (f⊕g)(x). Its complementary operation, theerosion, is denoted by (f &g)(x) [70]. The multiscale dilation-erosion pyramidof the image f(x) by gσ(x) is defined by [69]:

(f ∗ gσ)(x) =

⎧⎨⎩

(f ⊕ gσ)(x) if σ > 0f(x) if σ = 0

(f & g|σ|)(x) if σ < 0,(40)

where σ denotes the scale parameter of the structuring function.Such morphological operation can highlight and capture important infor-

mation for key facial features such as eyebrows, eyes, nose tip, nostrils, lips,face contour, etc. but can be affected by different illumination conditionsand noise [63]. To compensate for these conditions, the normalized multiscaledilation-erosion is proposed for facial image analysis. It is well known thatthe different illumination conditions affect the facial region in a non-uniformmanner. However, it can safely be assumed that the illumination changes arelocally uniform inside the area of the structuring element used for multiscaleanalysis. The proposed analysis is:

(f ∗ gσ)n(x) =

⎧⎨⎩

(f ⊕ gσ)(x) − µz∈Gσ(f(x − z)) if σ > 0

f(x) if σ = 0(f & g|σ|)(x) − µz∈G|σ|(f(x + z)) if σ < 0,

(41)

where µz∈Gσ(f(x − z)) and µz∈Gσ

(f(x + z)) are the mean values of the imagef(x − z), x − z ∈ D and f(x + z), x + z ∈ D inside the support area of thestructuring element Gσ = {z ∈ G : ||z|| < σ}, respectively. The structuringelement used in all experiments was cylindrical for computational complexityreasons [63,70]. The output of these morphological operations form the featurevector j(x) at the grid node located at image coordinates x. Figure 15 depictsthe output of the normalized dilation erosion for various scales. The firstnine pictures starting from the upper left corner are eroded images and theremaining nine are dilated images. The new dynamic link architecture will bedenoted as normalized morphological elastic graph matching (NMEGM) inthe rest of the paper.


Fig. 15. Output of normalized multi-scale dilation-erosion for nine scales

The next step of the elastic graph matching is to translate and deform thereference graph on the test image so that a cost function is minimized. Let thesuperscripts t and r denote a test and a reference person (or grid), respectively.The L2 norm, is used as a similarity measure, between the feature vectors atthe lth grid node of the reference and the test graph, i.e. Cu(j(xt

l), j(xrl )) =

||j(xtl) − j(xr

l )||.The objective is to find a set of vertices {xt

l , l ∈ V } that minimize the costfunction:

D(t, r) =∑

l∈V {Cu(j(xtl), j(x

rl ))} subject to

xtl = xr

l + s + ql, ||ql|| ≤ qmax,(42)

where s is a global translation of the graph and ql denotes a local perturbationof the grid nodes. The choice of qmax controls the rigidity/plasticity of thegraph. The cost function given by (42) defines the similarity measure betweentwo persons, in the morphological elastic graph matching.

8 Facial Expression Analysis

Facial expression analysis is very important, since movies employ them toconvey the message of the movie script and narrative. Therefore, actors aretrained to express well the emotions of their roles. A survey on the researchmade regarding facial expression analysis can be found in [71]. The approachesreported regarding facial expression analysis can be distinguished in two maindirections, the feature-based ones and the template based ones, accordingto the method they use for facial information extraction. The feature-basedmethods use texture or geometrical information as features for expressioninformation extraction. The template-based methods use 3D or 2D head andfacial models as templates for expression information extraction.

8.1 Feature Based Approaches

Facial feature detection and tracking is based on active InfraRed illumina-tion in [72], in order to provide visual information under variable lighting


and head motion. The classification is performed using a dynamic Bayesiannetwork (DBN). A method for static and dynamic segmentation and classifi-cation of facial expressions is proposed in [73]. For the static case, a DBN isused, organized in a tree structure. For the dynamic approach, multi-level hid-den Markov models (HMMs) classifiers are employed. The system proposedin [74] automatically detects frontal faces in the video stream and classifiesthem in seven classes in real time: neutral, anger, disgust, fear, joy, sadnessand surprise. An expression recognizer receives image regions produced by aface detector and then a Gabor representation of the facial image region isformed to be later processed by a bank of support vector machines (SVMs)classifiers. Gabor filters are also used [75] for facial expression recognition. Fa-cial expression images are coded using a multi-orientation, multi-resolution setof Gabor filters which are topographically ordered and approximately alignedwith the face. The similarity space derived from this facial image represen-tation is compared with one derived from semantic ratings of the images byhuman observers. The classification is performed by comparing the producedsimilarity spaces. The images are first transformed using a multiscale, mul-tiorientation set of Gabor filters [76]. The grid is then registered with thefacial image region either automatically, using elastic graph matching [77]or by manual clicking on fiducial face points. The amplitude of the complexvalued Gabor transform coefficients are sampled on the grid and combinedinto a single vector, called labelled graph vector (LGV). The classification isperformed using the distance of the LGV from each facial expression clustercenter. Gabor features are used for facial feature extraction given a set offiducial points in [78]. The classification is performed using Bayes, SVMs,Adaboost and Linear Programming classifiers.

A neural network (NN) is employed to perform facial expression recogni-tion [79]. The features used can be either the geometric positions of a set offiducial points on a face or a set of multi-scale and multi-orientation Gaborwavelet coefficients extracted from the facial image at the fiducial points. Therecognition is performed by a two layer perceptron NN. A convolutional NNwas used [80]. The system developed is robust to face location changes andscale variations. Feature extraction and facial expression classification wereperformed using neuron groups, having as input a feature map and properlyadjusting the weights of the neurons for correct classification. A method thatperforms facial expression recognition is presented [81]. Face detection is per-formed using a Convolutional NN, while the classification is performed usinga rule-based algorithm. Optical flow is used for facial region tracking and fa-cial feature extraction [82]. The facial features are inserted in a radial basisfunction (RBF) NN architecture that performs classification. Discrete cosinetransform (DCT) is used [83], over the entire face image as a feature detector.The classification is performed using a one-hidden layer feedforward NN. Afeature selection process that is based on principal component analysis (PCA)is proposed in [84]. A decision tree-based classifier that uses successive pro-jections onto more precise representation subspaces, is employed. The image


pixels are used [85] as input to PCA and linear discriminant analysis (LDA)to reduce the original feature space dimensionality. The resulting featuresare lexicographically ordered and concatenated to a feature vector, which isused for classification according to the nearest neighbor rule. The approachfollowed [86] uses structured and geometrical features of a user sketched ex-pression model. The classification is performed using linear edge mapping(LEM). Expressive face modelling, using an active appearance model (AAM)is employed in [87]. The facial model is constructed based on either three orone PCA. The classification is performed in the space of AAM.

Model-Template Based Approaches

Two methods for facial expression recognition are proposed in [88], basedon a 3D model enriched with muscles and skin. The first method estimatesfacial muscle actuations from optical flow data. The classification is performedaccording to its similarity to the classical patterns of muscle actuation. Thesecond method uses the classical patterns of muscle actuation to generate theclassical pattern of motion energy associated with each facial expression, thusresulting in a set of simple facial expression “detectors”, each of which looksfor the particular space–time pattern of motion energy associated with eachfacial expression.

A face model, defined as a point-based model composed of two 2D fa-cial views (frontal and profile views) is used in [89]. The deformation of facialfeatures is extracted from both the frontal and profile views and its correspon-dence with the facial action unit (FAUs) is established. The facial expressionrecognition is performed based on a set of decision rules. A 3D facial modelis proposed in [90]. Anatomically-based muscles are added to it. A Kalmanfilter in correspondence with optical flow computation are used to extractmuscle action in order to form a new model of facial action, the so-calledfacial action coding system (FACS). A 3D facial model used for facial expres-sion recognition is also proposed in [91]. First, the head pose is estimated ina facial video sequences. Subsequently, face images are warped onto a facemodel with canonical face geometry, then they are rotated to frontal onesand are projected back onto the image plane. Pixels brightness is linearlyrescaled and resulting images are convolved with a bank of Gabor kernels.The Gabor representations are then channelled to a bank of SVMs to performfacial expression recognition.

8.2 FAU Based Facial Expression Recognition

For FAUs detection, the approaches followed were also feature based. Manytechniques for FAUs recognition are proposed in [92]. PCA, independent com-ponent analysis (ICA), local features analysis (LFA), LDA, Gabor waveletrepresentations and local principal components (LPC) are investigated morethoroughly.


A group of FAUs is detected in [93]. The facial feature contours are ad-justed and both permanent and transient facial features changes are automat-ically detected and tracked in the image sequence. The facial parameters arethen fed into two NN classifiers, one for the upper face and one for the lowerface. FAUs detection is also investigated in [94]. Facial expression informationextraction is performed either by using the optical flow or by facial featurepoint tracking. The extracted information is used as an input in a HMMs sys-tem that has as an output upper face expressions at the forehead and browregions.

HMMs are also used in [95]. Dense optical flow extraction is used to trackflow across the entire face image, after the input image sequence is aligned.Facial feature tracking of a small set of pre-selected features is performed andhigh-gradient component detection uses a combination of horizontal, vertical,and diagonal line and edge feature detectors to detect and track changes instandard and transient facial lines and furrows. The results from the abovesystem are fed to a HMMs system to perform facial expression recognition.

A NN is employed for FAUs detection in [96]. The geometric facial features(including mouth, eyes, brows and cheeks) are extracted using multi-state fa-cial component models. After extraction, these features are represented para-metrically. The regional facial appearance patterns are captured using a setof multi-scale and multiorientation Gabor wavelet filters at specific locations.The classification is performed using a back-propagation NN.

Two novel fast feature-based methods are proposed in [97], that use SVMsclassifiers for recognizing dynamic facial expressions either directly or by firstlydetecting the FAUs. SVMs were chosen due to their good performance invarious practical pattern recognition applications [98–100], and their solidtheoretical foundations. A novel class of SVMs, which incorporates statisticinformation about the classes under examination, is also proposed in [101].The classification on both cases (facial expression recognition using multi-class SVMs or based on FAU detection) is performed using only geometricalinformation, without taking into consideration any facial texture information.

Let us consider an image sequence containing a face, whose facial expres-sion evolves from a neutral state (first frame) to a fully expressed state (lastframe). The proposed method is based on mapping and tracking the facialmodel Candide onto the video frames. The proposed facial expression recogni-tion system is semi-automatic, in the sense that the user has to manually placesome of the Candide grid nodes [54] on face landmarks depicted at the firstframe of the image sequence under examination. The tracking system allowsthe grid to follow the evolution of the facial expression over time till it reachesits highest intensity, producing at the same time the deformed Candide gridat each video frame. A subset of the Candide grid nodes is chosen, that pre-dominantly contribute to the formation of the facial deformations describedby FACS. The geometrical displacement of these nodes, defined as the dif-ference of each node coordinates at the first and the last frame of the facialimage sequence, is used as an input to a SVMs classifier (either the classical or


the proposed one). When facial expression recognition using multi-class SVMsis performed, the SVMs system consists of a six-class SVMs classifier, eachclass representing one of the six basic facial expressions (anger, disgust, fear,happiness, sadness and surprise). When FAU based facial expression recog-nition is performed, 8 or 17 FAUs are chosen that corresponds to the newempirically derived facial expressions rules and to the rules proposed in [89].Thus, the recognition system used is composed of a bank of two-class SVMs,each one detecting the presence or absence of a particular FAU that corre-sponds to a specific facial expression. The experiments were performed usingthe Cohn–Kanade database and the results show that the proposed novel fa-cial expression recognition system can achieve a recognition accuracy of 99.7or 95.1%, when recognizing six basic facial expressions on the Cohn–Kanadedatabase by the multi-class SVMs approach or by the FAU detection basedapproach, respectively.

8.3 Fusion of Geometrical and Texture Information for FEE

We combine geometrical and texture information in order to retrieve the facialexpression of a subject by making use of the well known candide grid [54]. Theblock diagram of the proposed method is shown in Fig. 16. Let U be a data-base of facial videos. The facial expression depicted in each video sequence is

Fig. 16. System architecture for facial expression recognition in frontal face videos


dynamic, evolving through time as the video progresses. We take under con-sideration the frame that depicts the facial expression in its greatest intensity,i.e. the last frame, to create a facial image database Y. Each image y ∈ Ybelongs to one of the six basic facial expression classes {Y1;Y2; ...;Y6} withY =

⋃6r=1 Yr. Each image y ∈ RK×G of dimension F = K × G is scanned

row-wise to form a vector x ∈ RF , that will be used in our algorithm. Thealgorithm used for texture extraction was the DNMF algorithm, which isa extension of the non-negative matrix factorization (NMF) algorithm. TheNMF algorithm is a matrix decomposition algorithm that allows only addi-tive combinations of non-negative components was the result of an attemptto introduce discriminant information to the NMF decomposition. Both NMFand DNMF algorithms will be presented analytically below.

The aim of NMF is to decompose a facial image xj into the form xj ≈ Zhj,i.e. to a set of basis images (the columns of Z) combined by a set of weightshj. Vector hj can also be considered as the projection vectors of the originalfacial vectors xj on a lower dimensional feature space . In order to apply NMFin the database Y , the matrix X ∈ RS × L = [xi;j ] should be constructed,where xi,j is the ith element of the jth image, S is the number of pixels andL is the number of images in the database. In other words, the jth column ofX is the xj facial image in vector form (i.e. x ∈ (R)S

+). NMF aims at findingtwo matrices Z ∈ RS×M

+ = [zi,k] and H ∈ (R)M×L+ = [hk,j ] such that

X ≈ ZH, (43)

where M is the number of dimensions taken under consideration (usuallyM $ S). The NMF factorization is the outcome of the following optimizationproblem

minZ,H

= DN (X‖ZH) subject to (44)

zi,k ≥ 0, hk,j ≥ 0,∑

i

zi,j = 1;∀j, (45)

The update rules for the weight matrix H and the basis matrix Z can befound in [102].

In order to incorporate discriminants constraints inside the NMF cost func-tion 8.3, we should use the information regarding the separation of the vectorshj into different classes. Let us assume that the vector hj that corresponds tothe jth column of the matrix H, is the coefficient vector for the ρth facial imageof the rth class that will be denoted as η

(r)(ρ) = [η(r)

(ρ),1; ...; η(r)(ρ,M)]

T . The mean

vector of the vectors η(r)(ρ) for the class r is denoted as µ(r) = [µ(r)

1 , ..., µ(r)M ]T

and the mean of all classes as µ = [µ,; ..., µM ]T . The cardinality of a facialclass Yr is denoted by Nr. Then, the within scatter matrix for the coefficientvectors hj is defined as:


Sw =6∑

Xr=1

Nr∑ρ=1

(η(r)ρ − µ(r)) · (η(r)

ρ − µ(r))T , (46)

where the between scatter matrix is defined as

Sb =6∑

r=1

Nr · (µ(r) − µ) · (µ(r) − µ)T . (47)

The discriminant constraints are incorporated by requiring tr[Sw] to be assmall as possible while tr[Sb] is required to be as large as possible. Thus thecost function to be minimized in this case is:

Dd(X‖(Z)DH) = DN (X‖ZDH) + γ tr[Sw] − δ tr[Sb], (48)

where γ and δ are constants. Following the same expectation maximization(EM) approach used by NMF techniques [102], the following update rules forthe weight coefficients hk,j that belong to the rth facial class are derived:

h(t)k,j =

T1 +√

T 21 + 4(2γ − (2γ + 2δ) 1

Nr)h(t−1)

k,j

2(2γ + 2δ) 1Nr

, (49)

where T1 is given by

T1 = (2γ + 2δ)

⎛⎝ 1

Nr

∑λ,λ =l

hk,λ

⎞⎠− 2δµk − 1. (50)

The update rules for the bases ZD, are given by

z′(t)i,k = z

(t−1)i,k

∑j h

(t)k,j

xi,j∑l z

(t−1)i,l h

(t)l,j∑

j h(t)k,j

, (51)

and

z(t)i,k =

z′(t)i,k∑l z

′l,k

. (52)

The above decomposition is a supervised non-negative matrix factorizationmethod that decomposes the facial images into parts while, enhancing the classseparability. The matrix Z†

D = (ZTDZD)−1ZT

D which is the pseudo-inverseof ZD, is then used for extracting the discriminant features as x′ = Z†

Dx.The most interesting property of DNMF algorithm is that it decomposes theimage to facial areas, i.e. mouth, eyebrows, eyes, and focuses on extracting theinformation hiding in them. For testing, the facial image xj is projected onthe low dimensional feature space produced by the application of the DNMFalgorithm:

x′j = Z†

Dxj . (53)


For the projection x′j of the facial image xj , the distance from each class center

is calculated. The smallest distance defined as

rj = argmink=[1..6]‖x′j − µ(k)‖ (54)

is the one that is taken as the output of the DNMF system.

Geometrical Information Extraction

The geometrical information extraction is done by a grid tracking system,based on deformable models [103]. The tracking is performed using a pyra-midal implementation of the well known Kanade–Lucas–Tomasi (KLT) algo-rithm. The user has to place manually a number of Candide grid nodes on thecorresponding positions of the face depicted at the first frame of the imagesequence. The algorithm automatically adjusts the grid to the face and thentracks it through the image sequence, as it evolves through time. At the end,the grid tracking algorithm produces the deformed Candide grid that corre-sponds to the last frame, i.e. the one that depicts the greatest intensity ofthe facial expression. The geometrical information used from the jth video se-quence is the displacements di

j of the nodes of the Candide grid, defined as thedifference between coordinates of this node in the first and last frame [103]:

dij = [∆xi

j∆yij ]

T i ∈ {1, ...,K} and j ∈ {1, ..., N}, (55)

where i is an index that refers to the node under consideration. In our caseK = 104 nodes were used. For every facial video in the training set, a featurevector gj of Q = 2 × 104 = 208 dimensions, containing the geometrical dis-placements of all grid nodes is created gj = [d1

jd2j , ..., d

Kj ]T . Let U be the video

database that contains the facial videos, that are clustered into six differentclasses mathcalUk, k = 1, ..., 6, each one representing one of six basic facialexpressions. The feature vectors gj ∈ RQ labelled properly with the true cor-responding facial expression are used as an input to a multiclass SVM thatwill be described in the following section.

Support Vector Machines

Consider the training data (g1, l1), ..., (gN , lN ) where gj ∈ RF j = 1, ..., Nare the deformation feature vectors and lj ∈ {1, ..., 6}, j = 1, ..., N are thefacial expression labels of the feature vector. The approach implemented forthe multiclass problem of facial expression recognition is the one describedin [104] that solves only one optimization problem for each class (facial ex-pression). This approach constructs six two-class rules where the kth functionwT

k φ(gj) + bk separates training vectors of the class k from the rest of thevectors. Here φ is the function that maps the deformation vectors to a higherdimensional space (where the data are supposed to be linearly or near linearly


separable), wk are the elements of the vector of the optimal separating hy-perplane created by the decision function and bk are the elements of the biasvector b = [b1, ..., b6]T . Hence, there are 6 decision functions, all obtained bysolving a different SVM problem for each class. The formulation is as follows:

minw,b,ξ

12

6∑k=1

wTk wk + C

N∑j=1

∑k =lj

ξkj (56)

subject to the constraints:

wTlj φ(gj) + blj ≥ wT

k φ(gj) + bk + 2− ξkj ξk

j ≥ 0, j = 1, .., N, k ∈ {1, ..., 6}, (57)

where C is the penalty parameter for non-linear separability and ξ =[..., ξm

i , ...]T is the slack variable vector. Then, the function used to calculatethe distance of a sample from each class center is defined as:

s(g) = arg maxk=1,..,6

(wTk φ(g) + bk). (58)

That distance was considered as the output of the SVM based geometricalextraction procedure. A linear kernel is used for the SVM system.

Fusion of Texture and Geometrical Information

The image xj and the corresponding vector of geometrical displacements gj

were taken into consideration. The DNMF algorithm, applied to the xj image,produces the distance rj as a result, while SVMs applied to the vector ofgeometrical displacements gj , produces the distance sj as the equivalent result.The distances rj and sj were normalized in [0,1] using Gaussian normalization.Thus, a new feature vector cj , defined as

cj = [rjsj ]T (59)

containing information from both sources was created. This feature vector wasused as an input to a similar 2 class SVM system that was described in theprevious section. The output of that system was the label lj that classifiedthe sample under examination to one of the six classes (facial expressions).

9 An Anthropocentric Video Content DescriptionStructure Based on MPEG-7

Nowadays, progress in image and video analysis spans the space betweensemantic description and low level processing and evolves towards a moresophisticated interpretation of the outputs of low level feature extraction al-gorithms. Many algorithms [105–109] have been developed for content based


image retrieval (CBIR) applications, in the past 20 years, which prove thatthis is an expanding area of research.

MPEG-7 is the most prominent scheme for multimedia content descrip-tion. However, its big breadth in terms of descriptors and description schemesmakes it hard to use in specific fields. MPEG-7 profiles have been introducedto solve this problem [110]. Here we will present a new MPEG-7 profile, whichcan be used in video content description and retrieval applications. Profiles, asdefined in [110], are sets of tools which provide functionalities for a certain ap-plication class. Actor(s) identities, status, activities and behavior are the mostimportant semantics in audio-visual content description, notably when narra-tive is involved [111], e.g. in movies and documentaries. Anthropocentric videocontent descriptors (AVCDs) is a framework for profiling the MPEG-7 videocontent format. The proposed anthropocentric (human centered) MPEG-7profile, provides supplementary functionalities to video content description,which are based on human feature extraction, such as those coming from facedetection, body motion estimation, face/body/body parts trajectory estima-tion, facial expression recognition, etc.

The proposed profile provides a structure which corresponds in a humancentered (to be called from now on an anthropocentric) perspective of theinformation that can be ingested from a movie. We rely on past and on-going research efforts that attempt to tackle the problems of face detection,face/body segmentation and tracking, facial expressions recognition and videoshot transition detection [108,109,112,113] to produce the necessary featuresfor such a description. The output of these algorithms are described in a morenormative way and are organized in several extended MPEG-7 types, whichwill be explained in more details in the subsequent paragraphs. In this way,the high complexity of the MPEG-7 scheme is simplified in order to providea user friendly video content description profile. This is achieved, by organiz-ing the derived descriptors within description schemes. Secondly, the profileis structured in a way that resembles the way that humans organize low levelvisual information in order to extract semantic information. Such a profilecan be easily used by the audio-visual production communities (e.g. film di-rectors, editors) if a friendly user interface is used that hides the complexityof MPEG-7 schemes.

The basic idea is to observe humans and their environment in video shotsand organize the video content description according to our perception abouthumans (and their context/background). Therefore, this profile introduces astructure where one can fill basic information that will be subsequently usedin order to extract semantic information. Figure 17 illustrates the differencesbetween a typical MPEG-7 file with an MPEG-7 file that is profiled with theproposed structure. Both attempt to describe an actor appearance in a shot.As you can see, we use the intuitive notions of actor, actor appearance, andactor instance (actor picture in a video frame), in the description. Such no-tions are absent in the pure MPEG-7 description that uses still regions and


Fig. 17. ActorAppearance description scheme (DS) vs. MovingRegion DS

Frame 1 Frame 2 Frame 3 Frame 4

Fig. 18. Actor appearances corresponding to Fig. 17

moving regions as elementary descriptors (Fig. 18). A more detailed descrip-tion of the differences between the two representations will be discussed inthe next paragraph.

This framework is object oriented in the sense of object oriented program-ming (OOP). Objects exist in a multimedia environment (a container object),where every object is constructed and which instantiates its member variables,that can also be objects. In contrast with OOP, there are no interactions be-tween objects (no messages are send between objects). Thus, we can safelysay that, in this object-oriented framework (OOF) one can see relations be-tween objects (inheritance), interconnections of objects (encapsulation). Theinheritance relation, which is implemented within the classes hides importantinformation. This is the most essential difference of this framework from theMPEG-7 one. In the proposed scheme, an object based description of themovie can be realized. Thus, video is described from the perspective of its


actors, objects and background (scene), and not as a mere sequential flow offrames. The advantage of using this approach in video analysis applicationsis faster access to useful information. This perspective provides several in-teresting aspects, which will be discussed later on. The motivation of usingmain objects and actors is based on the simple fact that they constitute theessential entities within movie narrative [111].

The anthropocentric notion is introduced in order to fulfill the need ofseveral applications to extract results which match the human interpretationof video (primarily movie) content. Let us suppose a movie shot shows per-sons entering/leaving a building. Person or face detection can be employedand every detected face/person (called actor in this case) can be subsequentlytracked to determine its trajectory and hence the direction of its motion tobe stored in an actor appearance structure (to be defined in the subsequentsection). This structure proposes a novel and normative way for storing re-sults of a tracked actor in this application. In movies, it is often the casethat actors have some predefined attributes such as the role or roles they areplaying within the movie, their real name, etc. These attributes are relatedwith the actor appearance within the framework. In Fig. 19, the MPEG-7 filewhich can be exported from an application that uses this profile, is shown.Similar video content descriptions are useful and in other applications, wherehumans are the most important entity in video shot, e.g. in visual surveil-lance. Although many descriptors and description schemes are already definedwithin MPEG-7, there are still applications that have to tackle the problemof implementing semantics within an MPEG-7 file. Description schemes (DSs)such as SemanticBaseDS which is defined in MPEG-7, is a good solution ingeneric applications. However, by narrowing down these DSs, we can supportseveral video analysis applications in a better way (in terms of retrieval speedand storage capacity), as well as resolve complexity issues that arise (lessspace, more compact description, etc.). In AVCDs profile the defined typesextend the MPEG-7 ISO standard and they provide new functionalities forvideo analysis applications. Two main categories of classes exist in the profile:containers and objects. At this point, it is important to make the distinctionbetween objects as a category of classes and objects as classes. More specif-ically, the class of an object is simply the formal representation of an objectof interest in a movie, such as a car, a train, a ball, etc. On the other hand,the object as a category of classes incorporates all the different objects thatplay a fundamental role in a movie, e.g. actors. Once this distinction has beenmade clear, the profile can be discussed in detail.

9.1 The Objects

Objects are the structural elements of the profile. Within a shot or a take, anobject can be detected at a video frame level and tracked for a number of videoframes. In this context, two description types can be defined, namely instances(containing static information at the level of video frames) and appearances


Fig. 19. The ActorAppearanceDS has all the information related to an actor andhis appearance in a video shot, e.g. facial expressions, facial bounding box position,pose

(containing dynamic information at the level of list of video frames). Forexample, an ActorAppearanceTypeDS contains information about the appear-ance and specific status (expressions, gestures, etc.), as well as several otherattributes of a particular actor for a number of video frames. One shot maycontain more than one actor appearances. One actor appearance may containseveral actor instances. Four object descriptors (to be defined subsequently)are implemented in this category. All the above-mentioned classes then act onthe containers and all the interactions are logged within the classes’ attributes.


Fig. 20. The ActorAppearanceTypeDS

The ActorAppearanceType DS

The ActorAppearanceType DS portrays information about the activity of anactor in a particular time interval within a movie. The attributes of this classare depicted in Fig. 20. The stored values of the class attributes reveals lowlevel information about the appearance of an actor in a certain time period.Time codes and duration information of the appearance is logged as well asmotion activity. The most essential part is the list of the ActorInstanceTypewhich will be detailed later on. One actor maybe associated with more thanone ActorAppearance types. One can also define groups of ActorAppearance-TypeDS, so that a better interpretation of the results can be achieved. Forgenerality, this grouping is not normative and is left to the judgment of eachimplementer vis-a-vis the application needs. The proposed framework sup-ports CollectionTypeDS in order to create groups of ActorAppearanceTypeDSas it is defined in MPEG-7 [114].

The ActorInstanceType DS

The ActorInstanceType contains low level information about an actor withina single video frame. This proposed anthropocentric framework, one can definecharacteristic actor instances and key actor instances, in the same way thatkey frames are defined in shot detection applications. As seen in Fig. 20, theActorInstance object is contained in an ActorAppearance one. The attributesassigned in the ActorInstanceType are visualized in Fig. 21. An actor instanceis characterized by its body parts, whose description is contained in a listof BodyPartsTypeDS and its status describing, e.g. actor expressions andactivities.

The BodyPartsTypeDS is a description scheme which contains informa-tion for a special region of an actor (e.g. the head, the arm, the whole body).It has an annotation for the part under description as well as its father bodypart if it exists, by reference. For example one might want to follow the hands


Fig. 21. The ActorInstance class

Fig. 22. The BodyParts class

movement. The description scheme can create one list of body parts for eacharm and subsequently one for each hand. The reason why this is not imple-mented in a recursive way is for compatibility with relational databases andeaser of implementation. The BodyPartsTypeDS is shown in Fig. 22.

In the BodyPartsTypeDS other than information of the parent and theannotation is the body part boarder description provided by the ROIDescrip-tionModelTypeDS. The region/person/face detection and tracking algorithmsuse different types to describe their outputs. The most popular ROI descrip-tions are the bounding boxes, the convex hulls, the feature point list (usedfor object tracking algorithms) and the feature point grids (for elastic graphbased object tracking algorithms) [115].

In the geometry tag of the ROIDescriptionModel, different attributes existto implement these ROI board description types. Since actor behavior andstatus are important for video content characterization, the Status tag is usedso as to characterize facial expressions and gestures. Finally the Activity tagis used in the same way as the MPEG-7 MotionDS description scheme. Itis an activity intensity integer and shows in a normative scale from 0 to 7the amount of activity for the specific ActorInstanceTypeDS instantiation.As mentioned before, actor instances are a static (non-temporal) parts of anactor appearance. The Activity tag though, is used in its narrative sense, e.g.we can say that an actor instance is a part of a fast walk movement. It cannotbe implemented in a higher level, e.g. in the actor appearance, because the


Fig. 23. The ObjectAppearance and ObjectInstance classes

activity may vary over time. The Activity tag can also be used to extract keyactor instances, where one has high activity or low activity. Therefore, we canextract semantic audio-visual information.

The ObjectAppearanceType and ObjectInstanceType DSs

Using the same logic, ObjectAppearance and ObjectInstance types shown inFig. 23, describe features of an object of interest. Notice that the Status tagcontains only the activity tag, because no pose or expressions information canbe extracted for objects (in contrast with actors).

10 Conclusions

Anthropocentric analysis of movies content is a new approach, which enablesinteresting features in many areas of video processing. The main idea is to findrobust algorithms, as the ones described in this chapter, to extract informationand therefore use a cognitive structure to store this information, thus revile theknowledge in the way low-level features of an image (frame), are connected.The MPEG-7 profile which is therein discussed, tends to and succeeds in away, to fill this need which can be interpreted as the semantic gap.

All the algorithms presented in the chapter are algorithms focused onhumans and thus provides interesting features for an anthropocentric analy-sis of a movie. We have chosen some of the basic analysis task such as facedetection and face tracking, as well as some more advanced, e.g. face cluster-ing, facial expression analysis or face verification. Our aim was to show thatby focusing on humans within a movie and because of there important rolewithin it, the analysis achieves a certain semantic level which is very closeto how humans interprets what they are seeing and/or feeling from a movie.The Anthropocentric analysis is a framework which integrates humans in atwofold manner first in the sense of what humans (actors) are doing withina movie and second how humans (spectators) are interpreting what they areseeing.


11 Acknowledgment

The work presented was partially supported by NM2 (New media for aNew Millennium), a European Integrated Project (http://www.ist-nm2.org),funded under the European Commission IST FP6 program.

References

1. M. -H. Yang, D. J. Kriegman, and N. Ahuja, Detecting faces in images: A sur-vey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24,no. 1, pp. 34–58, 2002.

2. E. Hjelmas and B. K. Low, Face detection: A survey, Computer Vision andImage Understanding, vol. 83, pp. 236–274, 2001.

3. G. Welch and E. Foxlin, Motion tracking: No silver bullet, but a respectable ar-senal, IEEE Computer Graphics and Applications, special issue on “Tracking”,vol. 22, no. 6, pp. 24–38, November/December 2002.

4. T. B. Moeslund and E. Granum, A survey of computer vision-based humanmotion capture, Computer Vision and Image Understanding, vol. 81, pp. 231–268, 2001.

5. D. M. Gavrila, The visual analysis of human movement: A survey, ComputerVision and Image Understanding, vol. 73, no. 1, pp. 82–98, 1999.

6. G. Chow and L. Xiaobo, Towards a system for automatic facial feature detec-tion, Pattern Recognition, vol. 26, no. 12, pp. 1739–1755, 1993.

7. G. Feng and P. Yuen, Multi-cues eye detection on gray intensity image, PatternRecognition, vol. 34, no. 5, pp. 1033–1046, 2001.

8. K. LAM and H. YAN, Locating and extracting the eye in human face images,Pattern recognition, vol. 29, no. 5, pp. 771–779, 1996.

9. T. Chen and R. Rao, Audio-visual integration in multimodal communication,Proceedings of the IEEE, vol. 86, no. 5, pp. 837–852, 1998.

10. M. J. R., Visual Speech Recognition with Stochastic Networks, Proceedings ofthe IEEE, vol. 86, no. 5, pp. 837–852, 1998.

11. E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision.Prentice Hall PTR Upper Saddle River, NJ, USA, 1998.

12. M. Pollefeys, ‘Tutorial on 3D modelling from figures, http://www.esat.kuleuven.ac.be/ pollefey/tutorial/, June 2000.

13. N. Vretos, V. Solachidis, and I. Pitas, A mutual information based face clus-tering algorithm for movies, Multimedia and Expo, 2006 IEEE InternationalConference on, pp. 1013–1016, 2006.

14. O. Arandjelovic and A. Zisserman, Automatic face recognition for film char-acter retrieval in feature-length films, in: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, San Diego, 2005, pp. 860–867.

15. A. Fitzgibbon and A. Zisserman, On affine invariant clustering and automaticcast listing in movies, in: ECCV, 2002.

16. T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y. W. Teh,E. Learned-Miller, and D. A. Forsyth, Names and faces in the news, in Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,(CVPR’04), vol. 2nd. IEEE, 2004, pp. 848–854.


17. J. Matas, M. Hamou, K. Jonsson, J. Kittler, Y. Li, C. Kotropoulos, A. Tefas,I. Pitas, T. Tan, H. Yan, F. Smeraldi, J. Bigun, N. Capdevielle, W. Gerstner,S. Ben-Yacouba, Y. Abdelaoued, and E. Mayoraz, Comparison of face verifi-cation results on the xm2vts database, in: Proc. of 2000 Int. Conf. on PatternRecognition (ICPR’00), 2000, pp. 858–863.

18. K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Bengio,F. Cardinaux, C. Sanderson, J. Czyz, L. Vandendorpe, S. Srisuk, M. Petrou,W. Kurutach, A. Kadyrov, R. Paredes, B. Kepenekci, F. Tek, G. Akar,F. Deravi, and N. Mavity, Face verification competition on the xm2vts data-base, in: AVBPA03, 2003, pp. 964–974.

19. L. Juwei, K. Plataniotis, and A. Venetsanopoulos, Face recognition using lda-based algorithms, IEEE Transactions on Neural Networks, vol. 14, no. 1, pp.195–200, 2003.

20. ——, Face recognition using kernel direct discriminant analysis algorithms,IEEE Transactions on Neural Networks, vol. 14, no. 1, pp. 117–126, 2003.

21. P. Ekman and W. V. Friesen, Emotion in the Human Face. Prentice Hall, NewJersey, 1975.

22. T. Kanade, J. Cohn, and Y. Tian, Comprehensive database for facial expressionanalysis, in: Proceedings of IEEE International Conference on Face and GestureRecognition, March 2000, pp. 46–53.

23. M. Pantic and L. Rothkrantz, Expert system for automatic analysis of facialexpressions, Image and Vision Computing, vol. 18, no. 11, pp. 881–905, 2000.

24. K. Sobottka and I. Pitas, Looking for faces and facial features in color images,Pattern Recognition and Image Analysis: Advances in Mathematical Theoryand Applications, Russian Academy of Sciences, vol. 7, no. 1, pp. 124–137,1997.

25. R. Lienhart and J. Maydt, An extended set of Haar-like features for rapid ob-ject detection, Image Processing. 2002. Proceedings. 2002 International Con-ference on, vol. 1, 2002.

26. J. Shi and C. Tomasi, Good features to track. in: IEEE International Confer-ence on Computer Vision and Pattern Recognition (CVPR94), Seattle, UnitedStates, June 1994, pp. 593–600.

27. B. D. Zarit, B. J. Super, and F. K. H. Quek, Comparison of five color models inskin pixel classification, in: ICCV99 International Workshop on Recognition,Analysis, and Tracking of Faces and Gestures in Real-Time Systems (RATFG-RTS99), Corfu, Greece, September 1999, pp. 58–63.

28. B. Martinkauppi, M. Soriano, and M. Laaksonen, Behavior of skin color un-der varying illumination seen by different cameras in different color spaces, inMachine Vision Applications in Industrial Inspection IX, Martin Hunt, EditorProceedings of SPIE, vol. 4301, Coimbra, Portugal, July 1999, pp. 102–112.

29. V. Vezhnevets, V. Sazonov, and A. Andreeva, A survey on pixel-based skincolor detection techniques, in: International Conference on Computer GraphicsBetween Europe and Asia (GRAPHICON-2003), Moscow, Russia, September2003.

30. A. Fitzgibbon and R. Fisher, A buyer’s guide to conic fitting, in: Fifth BritishMachine Vision Conference (BMVC99), Birmingham, UK, 1995, pp. 513–522.

31. E. Loutas, K. Diamantaras, and I. Pitas, Occlusion resistant object track-ing, in: IEEE International Conference on Image Processing (ICIP01), vol. 2,Thessaloniki, Greece, October 2001, pp. 65–68.


32. Z. Zhou and X. Geng, Projection functions for eye detection, Pattern Recogni-tion, vol. 37, no. 5, pp. 1049–1056, 2004.

33. J. Wu and Z. Zhou, Efficient face candidates selector for face detection, PatternRecognition, vol. 36, no. 5, pp. 1175–1186, 2003.

34. O. Jesorsky, K. Kirchberg, R. Frischholz, et al., Robust face detection usingthe hausdorff distance, Proceedings of Audio and Video Based Person Authen-tication, pp. 90–95, 2001.

35. W. Rucklidge, Efficient Visual Recognition Using the Hausdorff Distance.Springer, 1996.

36. D. Cristinacce, T. Cootes, and I. Scott, A multi-stage approach to facial featuredetection, 15th British Machine Vision Conference, London, England, pp. 277–286, 2004.

37. K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, XM2VTSDB: TheExtended M2VTS Database, Second International Conference on Audio andVideo-based Biometric Person Authentication, vol. 626, 1999.

38. The bioid face database.39. M. Turk and A. Pentland, Face recognition using eigenfaces, Computer Vi-

sion and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE ComputerSociety Conference on, pp. 586–591, 1991.

40. J. Canny, A computational approach to edge detection, IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679–698, 1986.

41. A. MacLeod and Q. Summerfield, A procedure for measuring auditory andaudio-visual speech-reception thresholds for sentences in noise: rationale, eval-uation, and recommendations for use. British Journal of Audiology, vol. 24,no. 1, pp. 29–43, 1990.

42. J. Luettin, N. Thacker, and S. Beet, Speechreading using shape and inten-sity information, Proceedings of the Fourth IEEE International Conference onSpoken Language Processing, vol. 1, pp. 58–61, 1996.

43. P. de Cuetos, C. Neti, and A. Senior, Audio-visual intent-to-speak detection forhuman–computer interaction, ICASSP IEEE INT CONF ACOUST SPEECHSIGNAL PROCESS PROC, vol. 4, pp. 2373–2376, 2000.

44. M. Siracusa, L. Morency, K. Wilson, J. Fisher, and T. Darrell, A multi-modalapproach for determining speaker location and focus, Proceedings of the FifthInternational Conference on Multimodal interfaces, pp. 77–80, 2003.

45. S. Siatras, N. Nikolaidis, and I. Pitas, Visual speech detection using mouthregion intensities, in Proceedings of European Signal Processing Conference(EUSIPCO 2006), September 2006.

46. P. Viola and M. Jones, Robust Real-Time Face Detection, International Jour-nal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.

47. P. Viola and M. Jones, Rapid object detection using a boosted cascade ofsimple features, Proceedings of IEEE CVPR, vol. 1, pp. 511–518, 2001.

48. S. Asteriadis, N. Nikolaidis, and I. Pitas, An Eye Detection Algorithm UsingPixel to Edge Information, in: Proceedings of ISCCSP 2006, vol. 1, 2006.

49. S. Kay, Fundamentals of Statistical Signal Processing, Volume 2: DetectionTheory. Prentice Hall PTR, 1998.

50. R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision.Cambridge University Press, 2003.

51. O. Faugeras, What can be seen in three dimensions with an uncalibrated stereorig, Proceedings of the Second European Conference on Computer Vision, pp.563–578, 1992.


52. P. Beardsley, A. Zisserman, and D. Murray, Sequential Updating of Projectiveand Affine Structure from Motion, International Journal of Computer Vision,vol. 23, no. 3, pp. 235–259, 1997.

53. R. Hartley, Euclidean reconstruction from uncalibrated views, Applications ofInvariance in Computer Vision, vol. 825, pp. 237–256, 1994.

54. M. Rydfalk, CANDIDE: A parameterized face, Linkoping University, Tech.Rep., 1978.

55. B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon, Bundle Adjustment –A modern synthesis, Vision Algorithms: Theory and Practice, vol. 1883, pp.298–372, 2000.

56. M. Everingham and A. Zisserman, Automated person identification in video.in CIVR, 2004, pp. 289–298.

57. Z. He, X. Xu, and S. Deng, K-anmi: A mutual information based clusteringalgorithm for categorical data, 2005. [Online]. Available: http://www.citebase.org/cgi-bin/citations?id=oai:arXiv.org:cs/0511013

58. R. L. Cannon, J. V. Dave, and J. C. Bezdek, Efficient implementation of thefuzzy c-means clustering algorithms, IEEE Trans. Pattern Anal. Mach. Intell.,vol. 8, no. 2, pp. 248–255, 1986.

59. M. Turk and A. P. Pentland, Eigenfaces for recognition. Journal of CognitiveNeuroscience, vol. 3, no. 1, pp. 71–86, 1991.

60. P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. fisher-faces: Recognition using class specific linear projection. IEEE Transactions onPattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, July1997.

61. M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg,R. P. Wurtz, and W. Konen, Distortion invariant object recognition in thedynamic link architecture. IEEE Transactions on Computers, vol. 42, no. 3,pp. 300–311, Mar. 1993.

62. B. Duc, S. Fischer, and J. Bigun, Face authentication with Gabor informationon deformable graphs. IEEE Transactions on Image Processing, vol. 8, no. 4,pp. 504–516, 1999.

63. C. Kotropoulos, A. Tefas, and I. Pitas, Frontal face authentication using dis-criminating grids with morphological feature vectors. IEEE Transactions onMultimedia, vol. 2, no. 1, pp. 14–26, Mar. 2000.

64. M. Kirby and L. Sirovich, Application of the Karhunen-Loeve procedure forthe characterization of human faces. IEEE Transactions Pattern Analysis andMachine Intelligence, vol. 12, no. 1, pp. 103–108, Jan. 1990.

65. D. L. Swets and J. Weng, Using discriminant eigenfeatures for image retrieval,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8,pp. 831–836, 1996. [Online]. Available: citeseer.ist.psu.edu/swets96using.html

66. A. Martinez and A. Kak, Pca versus lda,IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 23, no. 2, pp. 228–233, 2001.

67. L. Wiskott, J. Fellous, N. Kruger, and C. von der Malsburg, Face recognitionby elastic bunch graph matching. IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 19, no. 7, pp. 775–779, 1997.

68. A. Tefas, C. Kotropoulos, and I. Pitas, Using support vector machines to en-hance the performance of elastic graph matching for frontal face authentica-tion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23,no. 7, pp. 735–746, 2001.


69. P. T. Jackway and M. Deriche, Scale-space properties of the multiscalemorphological dilation-erosion, IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 18, no. 1, pp. 38–51, 1996. [Online]. Available:citeseer.ist.psu.edu/jackway92scale.html

70. I. Pitas and A. Venetsanopoulos, Nonlinear Digital Filters: Principles and Ap-plications. Norwell, MA: Kluwer, Academic Publishers, 1990.

71. B. Fasel and J. Luettin, Automatic facial expression analysis: A survey, PatternRecognition, vol. 36, no. 1, pp. 259–275, 2003.

72. I. Cohen, N. Sebe, S. Garg, L. S. Chen, and T. S. Huanga, Facial expressionrecognition from video sequences: temporal and static modelling, ComputerVision and Image Understanding, vol. 91, pp. 160–187, 2003.

73. Y. Zhang and Q. Ji, Active and dynamic information fusion for facial expressionunderstanding from image sequences, IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 27, no. 5, pp. 699–714, May 2005.

74. M. S. Bartlett, G. Littlewort, I. Fasel, and J. R. Movellan, Real time face detec-tion and facial expression recognition: Development and applications to humancomputer interaction, in: Proceedings of Conference on Computer Vision andPattern Recognition Workshop, vol. 5, Madison, Wisconsin, 16–22 June 2003,pp. 53–58.

75. M. J. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, Coding facial expres-sions with Gabor wavelets, in: Proceedings of the Third IEEE InternationalConference on Automatic Face and Gesture Recognition, 1998, pp. 200–205.

76. M. J. Lyons, J. Budynek, and S. Akamatsu, Automatic classification of singlefacial images, IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 21, no. 12, pp. 1357–1362, 1999.

77. L. Wiskott, J. Fellous, N. Kruger, and C. v. d. Malsburg, Face recognitionby elastic bunch graph matching, IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 19, no. 7, pp. 775–779, July 1997.

78. G. Guo and C. R. Dyer, Learning from examples in the small sample case: Faceexpression recognition, IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, vol. 35, no. 3, pp. 477–488, June 2005.

79. Z. Zhang, M. Lyons, M. Schuster, and S. Akamatsu, Comparison betweengeometry-based and Gabor-wavelets-based facial expression recognition usingmulti-layer perceptron, in: Proceedings of the Third IEEE International Con-ference on Automatic Face and Gesture Recognition, Nara Japan, 14–16 April1998, pp. 454–459.

80. B. Fasel, Multiscale facial expression recognition using convolutional neuralnetworks, IDIAP, Tech. Rep., 2002.

81. M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, Subject independent facialexpression recognition with robust face detection using a convolutional neuralnetwork, Neural Networks, vol. 16, no. 5–6, pp. 555–559, June–July 2003.

82. M. Rosenblum, Y. Yacoob, and L. S. Davis, Human expression recognition frommotion using a radial basis function network architecture, IEEE Transactionson Neural Networks, vol. 7, no. 5, pp. 1121–1138, September 1996.

83. L. Ma and K. Khorasani, Facial expression recognition using constructivefeedforward neural networks, IEEE Transactions on Systems, Man, AndCybernetics-Part B: Cybernetics, vol. 34, no. 3, pp. 1588–1595, June 2004.

84. S. Dubuisson, F. Davoine, and M. Masson, A solution for facial expression rep-resentation and recognition, Signal Processing: Image Communication, vol. 17,no. 9, pp. 657–673, October 2002.


85. X.-W. Chen and T. Huang, Facial expression recognition: A clustering-basedapproach, Pattern Recognition Letters, vol. 24, no. 9–10, pp. 1295–1302, June2003.

86. Y. Gao, M. Leung, S. Hui, and M. Tananda, Facial expression recognition fromline-based caricatures, IEEE Transactions on Systems, Man and Cybernetics-Part A: Systems and Humans, vol. 33, no. 3, pp. 407–412, May 2003.

87. B. Abboud, F. Davoine, and M. Dang, Facial expression recognition and syn-thesis based on an appearance model, Signal Processing: Image Communica-tion, vol. 19, no. 8, pp. 723–740, 2004.

88. I. A. Essa and A. P. Pentland, Facial expression recognition using a dynamicmodel and motion energy, in: Proceedings of the International Conference onComputer Vision (ICCV 95), Cambridge, MA, 20–23 June 1995.

89. M. Pantic and L. J. M. Rothkrantz, Automatic analysis of facial expressions:The state of the art, IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 22, no. 12, pp. 1424–1445, December 2000.

90. I. A. Essa and A. P. Pentland, Coding, analysis, interpretation, and recogni-tion of facial expressions, IEEE Transactions Pattern Analysis and MachineIntelligence, vol. 19, no. 7, pp. 757–763, July 1997.

91. M. S. Bartlett, G. Littlewort, B. Braathen, T. J. Sejnowski, and J. R. Movellan,An approach to automatic analysis of spontaneous facial expressions, in: Pro-ceedings of Fifth IEEE International Conference on Automatic Face and Ges-ture Recognition (FGR’02), Washington, D.C., 2002.

92. G. Donato, M. S. Bartlett, J. C. Hager, P. Ekman, and T. J. Sejnowski, Clas-sifying Facial Actions, IEEE Trans. on Pattern Analysis and Machine Intelli-gence, vol. 21, no. 10, pp. 974–989, 1999.

93. Y. L. Tian, T. Kanade, and J. Cohn, Recognizing Facial Actions by combin-ing geometric features and regional appearance patterns, Robotics Institute,Carnegie Mellon University, Tech. Rep. CMU-RI-TR-01-01, 2001.

94. J. J. Lien, T. Kanade, J. Cohn, and C. C. Li, Automated facial expressionrecognition based on FACS Action Units, in: Proceedings of Third IEEE Inter-national Conference on Automatic Face and Gesture Recognition, April 1998,pp. 390–395.

95. J. J. Lien, T. Kanade, J. F. Cohn, and C. Li, Detection, tracking, and classifica-tion of Action Units in facial expression, Journal of Robotics and AutonomousSystems, July 1999.

96. Y. L. Tian, T. Kanade, and J. Cohn, Evaluation of Gabor wavelet-based FacialAction Unit recognition in image sequences of increasing complexity, in: Pro-ceedings of the Fifth IEEE International Conference on Automatic Face andGesture Recognition, 2002, pp. 229–234.

97. A. Tefas, C. Kotropoulos, and I. Pitas, Using Support Vector Machines for faceauthentication based on elastic graph matching, in: Proceedings of the IEEEInternational Conference Image Processing (ICIP’2000), 2000, pp. 29–32.

98. H. Drucker, W. Donghui, and V. Vapnik, Support vector machines for spamcategorization, IEEE Transactions on Neural Networks, vol. 10, no. 5, pp.1048–1054, September 1999.

99. A. Ganapathiraju, J. Hamaker, and J. Picone, Applications of support vec-tor machines to speech recognition, IEEE Transactions on Signal Processing,vol. 52, no. 8, pp. 2348–2355, August 2004.


100. M. Pontil and A. Verri, Support vector machines for 3D object recognition,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20,no. 6, pp. 637–646, 1998.

101. I. Kotsia and I. Pitas, Facial expression recognition in image sequences usinggeometric deformation features and support vector machines, IEEE Transac-tions on Image Processing, vol. 16, no. 1, pp. 172–187, January 2007.

102. S. Zafeiriou, A. Tefas, I. Buciu, and I. Pitas, Exploiting discriminant infor-mation in non-negative matrix factorization with application to frontal faceverification, IEEE Transactions on Neural Networks, vol. 17, no. 3, pp. 683–695, May 2006.

103. I. Kotsia and I. Pitas, Real time facial expression recognition from image se-quences using support vector machines, in: IEEE International Conference onImage Processing (ICIP), 11–14 September 2005, pp. 966–969.

104. V. Vapnik, Statistical learning theory. Wiley, New York, 1998.105. R. Chellappa, C. L. Wilson, and S. Sirohey, Human and machine recognition

of faces: A survey. Proceedings of the IEEE, vol. 83, no. 5, pp. 705–740, May1995.

106. J. P. Eakins, Retrieval of still images by content, Lectures on informationretrieval, pp. 111–138, 2001.

107. J. K. Aggarwal and Q. Cai, Human motion analysis: A review, Computer Vi-sion and Image Understanding, vol. 73, no. 3, pp. 428–440, 1999.

108. E. Sikudova, M. A. Gavrielides, and I. Pitas, Extracting semantic informationfrom art images, in: Proceedings of International Conference on Computer Vi-sion and Graphics 2004 (ICCVG 2004), Warsaw, Poland, 22–24 September2004.

109. M. Krinidis, G. Stamou, H. Teutsch, S. Spors, N. Nikolaidis, R. Rabenstein, andI. Pitas, An audio-visual database for evaluating person tracking algorithms,in: Proceedings of IEEE International Conference on Acoustics, Speech andSignal Processing, Philadelphia, USA, 18–23 March 2005, pp. 452–455.

110. ISO (International Organization for Standardization), Overview of theMPEG-7 standard, International Organization for Standardization, Geneva,Switzerland, ISO Standard ISO/IEC JTC1/SC29 N4509, Dec. 2001.

111. G. Ahanger and T. D. C. Little, Data semantics for improving retrieval per-formance of digital news video systems, IEEE Transactions on Knowledge andData Engineering, vol. 13, no. 3, pp. 352–360, 2001.

112. M. Kyperountas, Z. Cernekova, C. Kotropoulos, M. Gavrielides, and I. Pitas,Scene change detection using audiovisual clues, in: Proceedings of Norwe-gian Conference on Image Processing and Pattern Recognition (NOBIM 2004),Stavanger, Norway, 27–28 May 2004.

113. Z. Cernekova, I. Pitas, and C. Nikou, Information theory-based shot cut/fadedetection and video summarization, IEEE Transactions on Circuits and Sys-tems for Video Technology, vol. 16, no. 1, pp. 82–91, January 2006.

114. ISO (International Organization for Standardization), Information technology–multimedia content description interface - part 5: Multimedia descriptionschemes, International Organization for Standardization, Geneva, Switzerland,ISO Standard ISO/IEC JTC 1/SC 29 N 4161, Dec. 2001.

115. N. N. G. Stamou and I. Pitas, Object tracking based on morphological elasticgraph matching, in Proceedings of the IEEE International Conference on ImageProcessing (ICIP 2005), Genova, Italy, September 2005.

Organizing Multimedia Information with Maps

Thomas Barecke1, Ewa Kijak2, and Marcin Detyniecki1,and Andreas Nurnberger3

1 LIP6, Universite Pierre et Marie Curie – CNRS, Paris, France,[email protected], [email protected]

2 IRISA, Universite de Rennes 1, Rennes, France [email protected] IWS, Otto-von-Guericke Universitat, Magdeburg, [email protected]

Summary. Semantic multimedia organization is an open challenge. In this chapter,we present an innovative way of automatically organizing multimedia information tofacilitate content-based browsing. It is based on self-organizing maps. The visualiza-tion capabilities of the self-organizing map provide an intuitive way of representingthe distribution of data as well as the object similarities. The main idea is to vi-sualize similar documents spatially close to each other, while the distance betweendifferent documents is bigger. We demonstrate this on the particular case of videoinformation. One key concept is the disregard of the temporal aspect during theclustering. We introduce a novel time bar visualization that reprojects the temporalinformation. The combination of innovative visualization and interaction methodsallows efficient exploration of relevant information in multimedia content.

1 Introduction

A huge and ever increasing amount of digital information is created each day.The capacity of the existing manifold storage devices (for instance hard drives,optical disks, flash memories) increases continuously. Multimedia informationin digital formats is, on the one hand, found everywhere in our everyday life,in devices such as portable media players, mobile phones, digital cameras.Thus, we already rely on the assistance of desktop search engines like GoogleDesktop, Beagle, or Spotlight for finding locally stored data.

On the other hand, the amount of publicly available information and itsboost is even more impressive. Apart from classical media, the recent web 2.0trend [1] of sharing user-created content is a major contributor. The blog sceneas well as community websites like Flickr [2], MySpace [3], or YouTube [4]constantly continue to grow both in terms of users and the sheer amount ofdata. Facing this amazing amount of information, it has become extremelydifficult and time consuming to filter and retrieve the relevant pieces.

T. Barecke et al.: Organizing Multimedia Information with Maps, Studies in Computational



494 T. Barecke et al.

A big challenge when dealing with multimedia is usually referred to as theSemantic Gap. It arises from the fact that there is a difference between thetechnical representation and the actual meaning of a given multimedia docu-ment. In other words, we cannot index multimedia information like numericalsince there is no unique, well-defined semantic for a given document. Ideally,multimedia retrieval should be based on the meaning, but unfortunately, acomputer is not able to identify it.

Multimedia retrieval systems [5] that provide satisfying interaction possi-bilities for all types of multimedia information are not yet available. A partic-ular problem is the ambiguity of visual, audio, and audio-visual information.One question of crucial importance is: How can we efficiently organize personaland public multimedia collections in order to facilitate the user’s access? Fromthe user’s perspective, in a multimedia retrieval system two tasks are of specialinterest: the search for a specific information and the exploration of a collec-tion. In this chapter we focus on the latter. We are concerned with presentingthe information in a convenient form to the user. We focus on organizing thedata into a structured view. The main target is to present a comprehensivesummary of a given collection to the user and to provide her with efficientbrowsing tools.

A major problem is the curse of dimensionality. For instance, the dimen-sionality of a simple text document, using a TF/IDF representation, equalsthe number of words in the dictionary. The RGB (and most other color spaces)description for digital images use three dimensions per pixel. Video informa-tion is even richer.

Organizing the data for convenient exploration has two requirements. Onthe one hand, the dimensionality has to be reduced in order to obtain avisualization in a human-interpretable space. On the other hand, similar datashould be grouped together, reducing the total amount of data representedat once. We show that self-organizing maps can fulfill both. We illustrate thison the particular case of video browsing. We also introduce a new innovativeuser interaction tool, an enhanced time bar. In this chapter, we do not focuson the feature extraction process, but rather on the content organization andvisualization once the features have been extracted.

The remainder of this chapter is organized as follows. In Sect. 2 we givean overview of related work. Then, we introduce the growing self-organizingmaps and how they can organize multimedia data. Finally, we illustrate thisby focusing on the particular case of video information.

2 Related Work

This chapter faces the challenge of content organization for efficient brows-ing. A still very common form of content visualization, used by all popularsearch engines like Google [6], is a simple ranking (by relevance, date, docu-ment name). The origin of this representation lies in the retrieval of textual

Organizing Multimedia Information with Maps 495

information using keywords as a query and computing relevance measures ofa document for a given query. In the example of Google, the relevance is basedboth on text similarity, e.g. measured by TF/IDF, and source link-reputation,e.g. measured by PageRank.

However, particularly for large collections, it is more convenient to havesimilar documents grouped together. The user first browses the group indexand then accesses only the documents classified in the group of interest. Themain question is: How can we measure the similarity of multimedia docu-ments?

A simple approach that tries to extent text retrieval on other typesof data is to index documents manually with keywords. For instance, theYahoo!Directory and Flickr [2] are based on manually classified documentsin hierarchically organized categories. A more recent approach, where thisindexing task is performed by the users, are tag clouds with the underlyingfolksonomy concept. There are several problems with this approach: First ofall, a lot of manual work is needed, even if it is distributed. Secondly, thegranularity of the keywords is crucial for the performance (e.g. do we assignthe keyword “car”, the more general keyword “vehicle” or the more specifickeyword “sports car” to a given object?). Finally, not everybody would asso-ciate the same keywords to a given document. However, this approach has alsobecome very popular. Researchers try to bypass the granularity problem bycreating ontologies. Some scientific work has been dedicated to automaticallyassociate labels with images. The great advantage of keyword-based search isthat users are already familiar with it.

In the early days, content-based image retrieval systems were solelybased on global low-level features, i.e. color, texture and shape descriptors.Some well-known examples are Virage [7], Photobook [8] from MITs MediaLaboratory and IBM’s QBIC [9]. Later region-based systems have been in-troduced [10–12]. These capture local image properties and hence refine theretrieval process. Current state-of-the-art systems, like SIMPLIcity [13], tryalso to capture semantic concepts through high-level features. An automaticway to obtain high-level descriptors is to apply machine learning techniquesto learn their associations with low-level features. Another popular approachis the use of relevance feedback [14, 15]. In fact, the user is required to eval-uate the results of a query. The system refines the search based on thesepreferences.

Current content based video retrieval systems, like the IBM Video Re-trieval System [16] and the MediaMill Systems [17], are also principally basedon low-level features. Machine learning methods are then applied in order tolearn associate high-level semantics to a set of low-level features. This highlevel feature extraction is still a major problem, addressed for instance in theTRECVID challenge [18]. Recently, it has been argued that for news videoretrieval we need only a few thousand semantic concepts [19]. Thus, it is ob-vious that even if we are able to describe multimedia content with high leveldescriptors, the feature space will always remain very high-dimensional.


Manifold dimensional reduction methods are available, projection high-dimensional data into a lower dimensional space. For a survey we refer thereader to [20, 21]. Probably the most frequently used technique is principalcomponent analysis (PCA), also referred to as singular value decomposition(SVD). PCA is a linear method aiming at identifying the directions withhighest variance in the feature space. It usually starts with a normalization ofeach variable to mean zero and standard deviation one. Then, one applies aspectral decomposition on the covariance matrix. The principal componentsare then given by the eigenvectors with the highest eigenvalues associated tothem. These form an orthogonal basis of the low-dimensional space. PCA isoptimal in the sense that, when re-projecting the data into the original space,the mean squared error is minimal amongst all possible linear transformations.However, the main disadvantage of PCA and other spectral methods, e.g.multi-dimensional scaling (MDS) which tries to preserve pairwise distancesinstead of maximizing variance, is the computational complexity arising fromthe spectral decomposition of a large matrix.

Apart from projecting the data into a feasible space, clustering methodsgroup similar items together and thus refine the structured view. In general,there are two main classes of clustering algorithms: hierarchical and partitionalmethods. In hierarchical clustering, larger clusters are either successively splitinto smaller clusters, or smaller clusters are successively merged. This resultsin a cluster hierarchy, the dendrogram. In order to obtain a given numberof clusters, the dendrogram is cut off at the appropriate height. Partitionalclustering directly tries to obtain k clusters, where k usually is a parameter.The k-means algorithm falls into this category.

Self-organizing maps (SOMs) [22] simultaneously provides both, a non-linear projection from a high-dimensional space, and a clustering of the data –including prototype vectors for each cluster – at the same time. Therefore, theyare very well suited to the data organization task. In contrast to PCA, MDSand independent component analysis (ICA), which are globally tuned andattach more importance to large distances than to small details, self-organizingmaps better preserve local neighborhood sets [23]. In fact, global relations maybe visualized using coloring schemes as we will demonstrate later.

It has been shown, that SOMs can be effectively used for the organizationof text [24–27], image [28–30], and music collections [31,32]. In the following,we illustrate that they are also able to cover video information.

3 Organizing Information with Semantic Maps

3.1 The Self-Organizing Maps

Self-organizing maps (SOMs) [22] are artificial neural networks, well suitedfor clustering and visualization of high-dimensional information. In fact, theymap high-dimensional data into a low-dimensional space (two-dimensional


Fig. 1. Structure of a hexagonally organized self-organizing map: the basic structureis an artificial neural network with two layers. Each element of the input layer isconnected to every element of the map.

map). The map is organized as a grid of symmetrically connected cells. Duringlearning, similar high-dimensional objects are progressively grouped togetherinto the cells. After training, objects that are assigned to cells close to eachother, in the low-dimensional space, are also close to each other in the high-dimensional space. As most clustering algorithms, SOMs operate on numericalfeature vectors. Its advantage is, that it is not limited to any special kind ofdata, since for all kinds of multimedia information well-studied numericaldescriptors can be computed.

The neuronal network structure of SOMs is organized in two layers (Fig. 1).The neurons in the input layer correspond to the input dimensions, here thecorresponding feature vector. The output layer (map) contains as many neu-rons as clusters needed. All neurons in the input layer are connected withall neurons in the output layer. The connection weights between input andoutput layer of the neural network encode positions in the high-dimensionalfeature space. They are trained in an unsupervised manner. Every unit in theoutput layer represents a prototype, i.e. here the center of a cluster of similardocuments.

In the traditional rectangular topology the distance depends on whether twocells are adjacent vertically (or rather horizontally) or diagonally. Therefore,our maps are based on cells organized in hexagonal form, because the distancesbetween any two adjacent cells are always constant on the map (see Fig. 1).

Before the learning phase of the network, the two-dimensional structureof the output units is fixed and the weights are initialized randomly. Duringlearning, the sample vectors are repeatedly propagated through the network.The weights of the most similar prototype ws (winner neuron) are modifiedsuch that the prototype moves toward the input vector wi. The Euclideandistance or scalar product is usually used as similarity measure. To preservethe neighborhood relations, prototypes that are close to the winner neuronin the two-dimensional structure are also moved in the same direction. Thestrength of the modification decreases with the distance from the winner


neuron. Therefore, the weights ws of the winner neuron are modified accordingto the following equation:

∀i : w′

s = ws + v(c, i) × δ × (ws − wi), (1)

where δ is a learning rate. By this learning procedure, the structure in the high-dimensional sample data is non-linearly projected to the lower-dimensionaltopology.

Although the application of SOMs is straightforward, a main difficultyis defining an appropriate size for the map. Indeed, the number of clustershas to be defined before starting to train the map with data. Therefore, thesize of the map is usually too small or too large to map the underlying dataappropriately, and the complete learning process has to be repeated severaltimes until an appropriate size is found. Since the objective is to organizemultimedia information, the desired size depends highly on the content. Anextension of self-organizing maps that overcomes this problem is the growingself-organizing map [27].

3.2 The Growing Self-Organizing Map

The main idea is to initially start with a small map and then add new unitsiteratively during training, until the overall error – measured, e.g. by theinhomogeneity of objects assigned to a unit – is sufficiently small. Thus themap adapts itself to the structure of the underlying data collection. The ap-plied method restricts the algorithm to add new units to the external unitsif the accumulated error of a unit exceeds a specified threshold value. Thisapproach simplifies the growing problem (reassignment and internal-topologydifficulties) and it was shown in [27] that it copes well with the introductionof data in low and high-dimensional spaces. The way a new unit is insertedis illustrated in Fig. 2. After a new unit has been added to the map, the mapis re-trained. Thus, all cluster centers are adjusted and the objects are reas-signed to the clusters. This implies that objects may change clusters and can

xi, yi: weight vectorsxk: weight vector of unit with highest errorm: new unitα, β: smoothness weightsComputation of new weight vector for xm for m:

xm =

[xk + α ∗ (xk − yk) +

n∑i=0,i�=k

(xi + β ∗ (xi − yi))

]∗ 1

n + 1

Fig. 2. Insertion of a new unit: when the cumulated error of a cell exceeds a thresh-old, a new unit xm is added to the map. It is placed next to the unit with the highesterror at the border of the map.


cause the emergence of empty clusters, i.e. clusters which “lost” their formerobjects to their neighbors. This might happen especially in areas where theobject density was already small.

3.3 Visualization

Most of the problems in visualizing multimedia content come from the vastamount of information available. Users need a lot of time to search for specificinformation by conventional browsing methods. Providing several connectedviews at different abstraction levels allows a significant time reduction. Thebasic idea of using self-organizing maps is to provide the user with as muchinformation as possible on a single screen, without overwhelming him. TheSOM itself serves as an overview over the entire content. It is a very powerfultool for presenting a structured data summarization to the user.

Indeed, if we deal with visual information, on each of its cells the mosttypical element of the cluster can be displayed. The user then needs methodsto refine his search on a lower level, which is established by the visualizationof the content of a cell, on demand.

The background colors of the SOM’s grid cells are used to visualize dif-ferent information about the clusters. After learning, shades of green indicatethe distribution of elements: the brightness of a cell depends on the numberof documents assigned to it. Later, the background color indicates the simi-larity of the cluster to a selected object. For a thorough discussion of coloringmethods for self-organizing maps we refer to [33].

When the user selects a specific object, the color of the map changesto shades of red. Here, the intensity of the color depends on the distancebetween the cluster centers and the currently selected document and thus isan indicator for its similarity. For instance, if we select a document that hasthe characteristics a and b, all the nodes with these characteristics will becolored in dark red and it will progressively change toward a brighter colorbased on the distance. This implies in particular that the current node will beautomatically colored in dark red, since by construction all of its elements aremost similar. In fact, objects that are assigned to cells close to each other, inthe low-dimensional space, are also close to each other in the high-dimensionalspace. However, this does not mean that objects with a small distance inthe high-dimensional space are necessarily assigned to cells separated by asmall distance on the map. For instance, we can have on one side of the mapa node with documents with the characteristic a and on another the oneswith characteristic b. Then in one of both, let’s say a-type, a document withcharacteristics a, but also b. According to the visualization schema presentedabove, when choosing a document that has characteristics a and b, located ina node A, we will easily identify nodes in which all the documents are ratherof type b. This improves significantly the navigation possibilities provided byother clustering schemes.


4 Example: Organizing Video Data

We present a prototype that implements methods to structure and visualizevideo content in order to support a user in navigating within a single video. Itfocuses on the way video information is summarized in order to improve thebrowsing of its content. Currently, a common approach is to use clusteringalgorithms in order to automatically group similar shots and then to visualizethe discovered groups in order to provide an overview of the considered videostream [34, 35]. The summarization and representation of video sequences isusually based on key frames. They are arranged in the form of a temporallist and hierarchical browsing is then based on the clustered groups. Self-organizing maps [22] are an innovative way of representing the clusters.

Since SOMs necessitate numerical vectors, video content has to be de-fined by numerical feature vectors that characterize it. A variety of significantcharacteristics has been defined for all types of multimedia information. Fromvideo documents, a plethora of visual, audio, and motion features is avail-able [36,37]. We rely on basic color histograms and ignore more sophisticateddescriptors, since our goal is to investigate the visualization and interactioncapabilities of SOMs for video structuring and navigation.

Our system is composed of feature extraction, structuring, visualization,and user interaction components (see Fig. 3). Structuring and visualizationparts are based on growing SOMs that were developed in previous works andapplied to other forms of interactive retrieval [27, 38]. We believe that grow-ing SOMs are particularly adapted to fit video data. The user interface wasdesigned with the intention to provide intuitive content-based video browsingfunctionalities to the user. In the following, we describe every system compo-nent and the required processing steps.

4.1 Video Preprocessing/Feature Extraction

The video feature extraction component supplies the self-organizing map withnumerical vectors and therefore it forms the basis of the system. This process

Fig. 3. The components of our prototype. This figure illustrates the data flow fromraw multimedia information to visualization and user interaction.


Fig. 4. Video feature extraction.

is shown in Fig. 4. The module consists of two parts, temporal segmentationand feature extraction.

Temporal Segmentation

The video stream is automatically segmented into shots by detecting theirboundaries. A shot is a continuous video sequence taken from one singlecamera. We identify shot boundaries by searching for rapid changes of the dif-ference between color histograms of successive frames, using a single threshold.In fact, transitions from one shot to another are usually associated with sig-nificant changes between consecutive frames while consecutive frames withina shot are very similar. Other properties that allow distance estimation be-tween images include texture, and shape features. It was shown in [39] thatthe approach performs rather well for detecting cuts. We use the (intensity,hue, saturation) IHS color space, because of its suitable perceptual propertiesand the independence between the three color space components.

Falsely detected shot boundaries can be caused for example by more so-phisticated editing effects, such as fades or dissolves, or noisy data. A simplefiltering process allows the reduction of the number of false positives, i.e. a setof two successive frames which belong to the same shot although the differenceof their color histograms exceeds the given threshold. Our filter deletes shotswith an insufficient number of frames (usually less than 5) and adds thesesequences to the next actual shot. However, the number of false positives doesnot have a great influence on our approach, since similar shots will be assignedto the same cluster, as described in the following.

Feature Extraction

In order to obtain a good clustering, a reasonable representation of the videosegments is necessary. For each shot, one key frame is extracted (we choose the


median frame of a shot) along with its color histograms. Apart from a globalcolor histogram, histograms for the top, bottom, left, and right regions of theimage are also computed. The self-organizing map is trained with a vectormerging all partial histogram vectors, which is then used to define each shot.

Similarity Between Shots

As in any clustering algorithm the main problem is how to model the similaritybetween the objects that are going to be grouped into one cluster. We modelthe difference of two video sequences by the Euclidean distance of the twovectors that were extracted from the video. However, this distance does notnecessarily correspond to a dissimilarity perceived by a human. In addition,these features represent only a small part of the video content. Also, thereremains a semantic gap between the video content and what we see on themap.

We are mainly interested in organizing the video data. For this purpose,SOMs assist the user by structuring the content based on visual similarity.However, we can not guarantee that the shots are grouped semantically.

4.2 Visualization

Additionally to the general problem of the vast amount of information avail-able, video information includes a temporal aspect that makes traditionalsearch and browsing even less effective. Our system represents a video shot bya single key frame and constructs higher level aggregates of shots. The userhas the possibility to browse the content in several ways. We combine ele-ments providing information on three abstraction levels on a single interfaceas shown in Fig. 5. First, there is an overview over the whole content providedby the self-organizing map window. On each cell, the most typical key frameof a cluster, is displayed. The second level consists of a combined content-based and time-based visualization. Furthermore, a list of shots is providedfor each grid cell and a control derived from the time bar control helps toidentify content that is similar to the currently selected shot.

Self-Organizing Map Window

The self-organizing map window (see Fig. 6) contains the visual representationof the SOM. The clusters are represented by hexagonal nodes. The most typ-ical key frame of the cluster, i.e. the key frame which is closest to the clustercenter, is displayed on each node. If there are no shots assigned to a specificnode no picture is displayed. These empty clusters emerge during the learningphase as described earlier.

After this first display, a click on a cell opens a list of shots assigned to thespecific cell (see Sect. 4.2). The user can then select a specific shot from the


Fig. 5. Screenshot of the interface: the player in the top left corner provides videoaccess on the lowest interaction level. The time bar and shot list provide an inter-mediate level of summarized information while the growing self-organizing map onthe right represents the highest abstraction level. The selected shot is played and itstemporal position is indicated on the time bar whose black extensions correspondto the content of the selected cell (marked with black arrows).

list. In other words, from user interaction perspective the map is limited tothe following actions: select nodes and communicate cluster assignment andcolor information to the time bar. Nevertheless it is a very powerful tool whichis especially useful for presenting a structured summarization of the video tothe user.

Player and Shot List

The player is an essential part of every video browsing application. Since thevideo is segmented into shots, functionalities were added especially for thepurpose of playing previous and next shots.

A shot list window showing all key frames assigned to a cell (Fig. 5) isadded to the interface every time a user selects a node from the map. Multipleshot lists for different nodes can be open at the same time representing eachshot by a key frame. These key frames correspond to the actual selected nodein the self-organizing map, as described in Sect. 4.2. When clicking on one ofthe key frames, the system plays the corresponding shot in the video. The


(a) (b)

Fig. 6. Growing self-organizing map. (a) After training. The brightness of a cellindicates the number of shots assigned to each node. On each node the key frameof the shot with the smallest difference to the cluster center is displayed. (b) Aftera shot has been selected. The brightness of a cell indicates the distance betweeneach cluster center and the key frame of the chosen shot. Notice that sequences inadjacent cells are similar as intended.

button for playing the current node is a special control, which results in aconsecutive play operation of all shots corresponding to the selected node,starting with the first shot. This adds another temporal visualization methodof the segmented video.

Time Bar

The time bar of our prototype (Fig. 7) reintroduces the temporal aspect intothe interface, which is ignored by the SOM. The colors of the self-organizingmap are projected into the temporal axis. With this approach, it is possible tosee within the same view the information about the similarity of key framesand the corresponding temporal information. A green double arrow displaysthe current temporal position within the video. Additionally, there are blackextensions on the time bar at the places where the shots of the selected nodecan be found. This cell can differ from the cluster of the currently selectedshot, in which case the black bars correspond to the selected cluster while thecolor scheme is based on the selected shot from another cluster. This enablesthe comparison of a family of similar shots with a cluster.

There are two interaction possibilities with our time bar. By clicking onceon any position, the system plays the corresponding shot. Clicking twice forcesthe self-organizing map to change the currently selected node to the one cor-responding to the chosen frame. And therefore, the background color schemaof the map is recomputed.


Fig. 7. The time bar control provides additional information. The brightness ofthe color indicates the distribution of similar sequences on the time scale. Aroundthe time bar, black blocks visualize the temporal positions of the shots assignedto the currently selected node. Finally, the two arrows point out the actual playerposition.

Fig. 8. User interactions. All listed elements are visible to the user on one singlescreen and always accessible thus providing a summarization on all layers at thesame time.

4.3 User Interaction

The four components presented above are integrated into one single screen(Fig. 5) providing a structured view of the video content. The methods for userinteraction are hierarchically organized (Fig. 8). The first layer is representedby the video viewer. The shot lists and time bar visualize the data on thesecond layer. The self-organizing map provides the highest abstraction level.

The self-organizing map is situated in the third layer. The user can selectnodes and retrieve their content, i.e. the list of corresponding key frames. Thetime bar is automatically updated by visualizing the temporal distributionof the corresponding shots when the current node is changed. Thus, a direct


link from the third to the second layer is established. Furthermore, after acertain shot has been selected, the user also views the temporal distributionof similar shots inside the whole video on the time bar. In the other direction,selecting shots using both the time bar and the list of key frames causesthe map to recompute the similarity values for its nodes and to change theselected node. The color of the grid cells is computed based on the distanceof its prototype to the selected shot. The same colors are used inside the timebar. Once the user has found a shot of interest, he can easily browse throughsimilar shots using the color indication on the time bar or map.

Notice that the first layer cannot be accessed directly from the third layer.Different play operations are activated by the time bar and shot lists. Theplayer itself gives feedback about its current position to the time bar. Thetime bar is actualized usually when the current shot changes.

All visualization components are highly interconnected. In contrast toother multi-layer interfaces, the user can always use all provided layers si-multaneously within the same view. He can select nodes from the map, keyframes from the list or from the time bar, or even nodes from the time bar bydouble-clicking.

5 Conclusions

The organization of multimedia information is a complex and challenging task.In this chapter, we proposed the use of growing self-organizing maps to assistthe user in his browsing and information retrieval task. In the one hand, self-organizing maps efficiently structure the content based on any given similaritymeasure. In the other hand, although, no perfect (semantic) similarity measurefor multimedia documents exist and although this uncertainty remains underany form of visualization, coloring schemes for self-organizing maps allow toeasily localize similar documents to a given query example.

We illustrated the efficiency of SOMs with a prototypical content-basedvideo navigation system. Our interface allows the user to interact withthe video content from two perspectives: the temporal as well as content-based representations. In fact, ignoring the temporal aspect during clusteringenhances the quality of the organization by similarity distribution. The tem-poral aspects are visually re-linked using similar colors. Three hierarchicallyconnected abstraction levels facilitate the user’s navigation.

The combination of innovative visualization and interaction methodsallows efficient exploration of relevant information in multimedia content.

References

1. O’Reilly, T.: What Is Web 2.0? Design Patterns and Business Models for theNext Generation of Software. http://www.oreillynet.com/ (last visited April 5,2007)


2. Flickr. http://www.flickr.com/ (last visited April 5, 2007)3. MySpace. http://www.myspace.com/ (last visited April 5, 2007)4. YouTube. http://www.youtube.com/ (last visited April 5, 2007)5. Bade, K., De Luca, E.W., Nurnberger, A.: Multimedia retrieval: Fundamen-

tal techniques and principles of adaptivity. KI: German Journal on ArtificialIntelligence 18 (2004) 5–10

6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.Computer Networks 30 (1998) 107–117

7. Bach, J.R., Fuller, C., Gupta, A., Hampapur, A., Horowitz, B., Humphrey, R.,Jain, R., Shu, C.F.: Virage image search engine: an open framework for imagemanagement. In Sethi, I.K., Jain, R.C., eds.: Proc. SPIE. Volume 2670 (1996)76–87.

8. Pentland, A., Picard, R., Sclaroff, S.: Photobook: content-based manipulation ofimage databases. International Journal of Computer Vision 18 (1996) 233–254.

9. Flickner, M., Sawhney, H.S., Ashley, J., Huang, Q., Dom, B., Gorkani, M.,Hafner, J., Lee, D., Petkovic, D., Steele, D., Yanker, P.: Query by image andvideo content: The QBIC system. IEEE Computer 28 (1995) 23–32

10. Carson, C., Thomas, M., Belongie, S., Hellerstein, J., Malik, J.: Blobworld: Asystem for region-based image indexing and retrieval. In: Third InternationalConference on Visual Information Systems. Springer, Berlin Heidelberg NewYork (1999) 509–516

11. Omhover, J.F., Detyniecki, M., Bouchon-Meunier, B.: A region-similarity-basedimage retrieval system. In Bouchon-Meunier, B., Coletti, G., Yager, R., eds.:Modern Information Processing: From Theory to Applications. Elsevier, Ams-terdam (2005)

12. Natsev, A., Rastogi, R., Shim, K.: WALRUS: A similarity retrieval algorithmfor image databases. IEEE Transactions on Knowledge and Data Engineering16 (2004) 310–316

13. Wang, J., Li, J., Wiederhold, G.: SIMPLIcity: semantics-sensitive integratedmatching for picturelibraries. IEEE Transactions on Pattern Analysis andMachine Intelligence 23 (2001) 947–963

14. Rui, Y., Huang, T., Mehrotra, S.: Content-based image retrieval with rele-vance feedback in MARS. In: Proceedings on International Conference on ImageProcessing (1997)

15. Kim, D., Chung, C.: QCluster: relevance feedback using adaptive clustering forcontent-based image retrieval. In: Proceedings of ACM SIGMOD InternationalConference on Management of data, New York, NY, USA, ACM Press (2003)599–610

16. Campbell, M., Haubold, A., Ebadollahi, S., Joshi, D., Naphade, M.R., Natsev,A., Seidl, J., Smith, J.R., Scheinberg, K., Tesic, J., Xie, L.: IBM ResearchTRECVID-2006 video retrieval system. In: NIST TRECVID-2006 Workshop(2006)

17. Worring, M., Snoek, C., de Rooij, O., Nguyen, G., Smeulders, A.: The mediamillsemantic video search engine. In: Proceedings of IEEE International Conferenceon Acoustics, Speech, and Signal Processing (2007)

18. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and trecvid. In: MIR’06: Proceedings of the Eighth ACM International Workshop on MultimediaInformation Retrieval, New York, NY, USA, ACM Press (2006) 321–330


19. Hauptmann, A., Yan, R., Lin, W.H.: How many high-level concepts will fill thesemantic gap in news video retrieval? In: Proceedings of the ACM InternationalConference on Image and Video Retrieval, CIVR (2007)

20. Fodor, I.K.: A survey of dimension reduction techniques. Technical Report,Lawrence Livermore National Laboratory (2002)

21. Burges, C.J.: Geometric methods for feature extraction and dimensional reduc-tion: A guided tour. Technical Report, Microsoft Research (2004)

22. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin Heidelberg NewYork (1995)

23. Kaski, S.: Data Exploration Using Self-Organizing Maps. PhD thesis, HelsinkiUniversity of Technology (1997)

24. Lin, X., Marchionini, G., Soergel, D.: A selforganizing semantic map for informa-tion retrieval. In: Proceedings of the 14th International ACM/SIGIR Conferenceon Research and Development in Information Retrieval, New York, ACM Press(1991) 262–269

25. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paattero, V.,Saarela, A.: Self organization of a massive document collection. IEEE Transac-tions on Neural Networks 11 (2000) 574–585

26. Roussinov, D.G., Chen, H.: Information navigation on the web by clustering andsummarizing query results. Information Processing & Management 37 (2001)789–816

27. Nurnberger, A., Detyniecki, M.: Visualizing changes in data collections usinggrowing self-organizing maps. In: Proceedings of International Joint Conferenceon Neural Networks (IJCANN 2002), IEEE (2002) 1912–1917

28. Laaksonen, J., Koskela, M., Oja, E.: PicSOM-self-organizing image retrievalwith MPEG-7 content descriptors. IEEE Transactions on Neural Network 13(2002) 841–853

29. Koskela, M., Laaksonen, J.: Semantic annotation of image groups with self-organizing maps. In: Leow, W.K., Lew, M.S., Chua, T.S., Ma, W.Y., Chaisorn,L., Bakker, E.M., eds.: Proceedings of the Fourth International Conference onImage and Video Retrieval (CIVR 2005). Volume 3568 of Lecture Notes inComputer Science, Berlin, Springer-Verlag, Berlin Heidelberg New York (2005)518–527

30. Nurnberger, A., Klose, A.: Improving clustering and visualization of multime-dia data using interactive user feedback. In: Proceedings of the Ninth Interna-tional Conference on Information Processing and Management of Uncertaintyin Knowledge-Based Systems (2002) 993–999

31. Pampalk, E., Rauber, A., Merkl, D.: Content-based organization and visu-alization of music archives. In: MULTIMEDIA ’02: Proceedings of the TenthACM International Conference on Multimedia, New York, NY, USA, ACM Press(2002) 570–579

32. Knees, P., Schedl, M., Pohle, T., Widmer, G.: An innovative three-dimensionaluser interface for exploring music collections enriched with meta-informationfrom the web. In: ACM Multimedia, Santa Barbara, CA, USA (2006)

33. Vesanto, J.: SOM-based data visualization methods. Intelligent-Data-Analysis3 (1999) 111–26

34. Lee, H., Smeaton, A.F., Berrut, C., Murphy, N., Marlow, S., O’Connor, N.E.:Implementation and analysis of several keyframe-based browsing interfaces todigital video. In: Borbinha, J., Baker, T., eds.: LNCS. Volume 1923 (2000)206–218


35. Girgensohn, A., Boreczky, J., Wilcox, L.: Keyframe-based user interfaces fordigital video. Computer 34 (2001) 61–67

36. Marques, O., Furht, B.: Content-Based Image and Video Retrieval. Kluwer,Norwell, MA (2002)

37. Veltkamp, R.C., Burkhardt, H., Kriegel, H.P.: State-of-the-Art in Content-Based Image and Video Retrieval. Kluwer, Norwell, MA (2001)

38. Nurnberger, A., Detyniecki, M.: Adaptive multimedia retrieval: From data touser interaction. In: Strackeljan, J., Leiviska, K., Gabrys, B., eds.: Do SmartAdaptive Systems Exist – Best Practice for Selection and Combination of Intel-ligent Methods. Springer-Verlag, Berlin Heildelberg New York (2005)

39. Browne, P., Smeaton, A.F., Murphy, N., O’Connor, N., Marlow, S., Berrut, C.:Evaluating and combining digital video shot boundary detection algorithms. In:Proceedings of Irish Machine Vision and Image Processing Conference, Dublin(2000)

Video Authentication Using RelativeCorrelation Information and SVM

Mayank Vatsa1, Richa Singh1, Sanjay K. Singh2, and Saurabh Upadhyay2

1 West Virginia University, [email protected], [email protected]

2 Purvanchal University, [email protected]

Summary. Video authentication is often presented as evidence in many criminalcases. Therefore, the authenticity of the video data is of paramount interest. Thispaper presents an intelligent video authentication algorithm using support vectormachine. The proposed algorithm does not require the computation and storage ofsecret key or embedding of watermark. It computes the local relative correlationinformation and classifies the video as tampered or non-tampered. Performance ofthe proposed algorithm is not affected by acceptable video processing operationssuch as compression and scaling and effectively classifies the tampered videos. On adatabase of 795 videos, the proposed algorithm outperforms the existing algorithmby 18.5%.

1 Introduction

In today’s digital era, communication and compression techniques facilitatesharing of multimedia data such as image and video. However, multimediaediting tools can be used to efficiently and seamlessly alter the content ofdigital data thus compromising the reliability. In some applications, the re-liability of video data is of paramount interest such as in video surveillance,forensics, law enforcement, and content ownership. For example, in court oflaw, it is important to establish the trustworthiness of any video that is usedas evidence. So, video authentication is a process which ascertains that thecontent in a given video is authentic and exactly same as when captured.It also detects the type and location of malicious tampering. To accomplishthis task automatically, several algorithms have been proposed which extractunique and resilient features from video and generate an authentication data.This authentication data is further used to establish the authenticity of thevideo content.

There are several possible attacks that can be applied to alter the contentsof a video data. These attacks can be classified into five classes.

M. Vatsa et al.: Video Authentication Using Relative Correlation Information and SVM,



512 M. Vatsa et al.

Fig. 1. Example of frame addition attack. Top row shows the original frame se-quence with frames 10 and 18. Bottom row shows the frame sequence after attackin which a new frame is inserted between 10 and 18 and frame 18 becomes frame 19

1. Frame addition attack. In frame addition attack, additional frames aredeliberately inserted at some position in a given video. This attack isintended to camouflage the actual content and provide incorrect informa-tion. A simple example of frame addition attack is shown in Fig. 1.

2. Frame removal attack. In frame removal attack, frames are intentionallyremoved from the video. This attack is common in criminal investigationwhere an intruder wants to remove his/her presence from a surveillancevideo. Figure 2 shows an example of the frame removal attack.

3. Frame shuffling attack. In frame shuffling attack, frames of a video areshuffled so that correct frame sequence is intermingled. Figure 3 shows anexample in which two frames are shuffled.

4. Frame alteration attack. In frame alteration attack, objects in a frame aremodified such as object addition and alteration. For example, Fig. 4 showsthe object alteration attack where a human figure is inserted. Figure 5shows another example of the object alteration attack in which object ona wall is removed.

5. Other attacks. Image and video processing operations such as noise addi-tion, blurring, and specular reflection addition can also be used to tamperthe content of the video. Further, combination of any two or more attackscan also be used to alter the content of the video data.

Other than these attacks, there are some image and video processingoperations such as compression and scaling which may affect the contentand properties of the video data. However, these operations are acceptable

Video Authentication Using Relative Correlation Information and SVM 513

Fig. 2. Example of frame removal attack. Top row shows the original frame sequencewith frames 10, 18, and 26. Bottom row shows the frame sequence after removalattack in which frame 18 is removed from the video and hence frame 26 becomesframe 25

Fig. 3. Example of frame shuffling attack. Top row shows the original frame se-quence with frames 10, 18, and 26. Bottom row shows the frame sequence aftershuffling attack in which the positions of frame 10 and frame 26 has been inter-changed

514 M. Vatsa et al.

Fig. 4. Example of frame alteration attack. The first frame is the original frameand second frame have been altered by inserting a human figure in the frame

Fig. 5. Example of frame alteration attack. The first frame is the original frameand second frame has been altered by removing the object hanging on the wall

and not considered as tampering. Video authentication algorithms shouldbe able to differentiate between intentional tampering and acceptable opera-tions. We next present a brief literature review of existing video authenticationalgorithms.

1.1 Literature Review

Existing video authentication algorithms can be broadly classified into threecategories: digital signature based authentication methods, watermarkingbased authentication methods, and other video authentication methods.In digital signature based authentication schemes, authentication data isseparately stored either in user defined field such as in the header of MPEGsequence, or in a separate file whereas watermarking embeds the authen-tication data into the primary multimedia sources. Researchers have also


proposed several other authentication techniques apart from digital signatureand watermarking based techniques.

For authenticating the multimedia data, digital signature based algorithmwas first introduced by Diffie and Hellman [4]. In [22], Wohlmacher proposeda digital signature based authentication algorithm which depends on the con-tent of video and secret information which is only known to the signer. Thismethod is used to verify the integrity of multimedia data which is endorsedby the signer. Lin and Chang [10] proposed two robust digital signature basedalgorithms for authenticating video in different kind of situations. The firstauthentication algorithm is used in situation where group of pictures (GOP)structure of the video is not modified. Second algorithm operates when GOPstructure is modified but the pixel values are preserved.

Celik et al. [2] proposed authentication algorithm in which digital signatureis generated from image blocks and this digital signature is used as the water-mark. Ditmann [5] and Queluz [14] used edge features of the image to generatethe digital signature. These algorithms are robust to high quality compressionand scaling but the performance of these algorithms depend on the edge detec-tion algorithm and are computationally expensive. Lu and Liao [11] proposedstructural digital signature based authentication algorithm which can resistincidental manipulations. Further, Bhattacharjee and Kutter [1] proposed analgorithm to generate digital signature by encrypting the feature point posi-tions in an image/video. In this approach, videos are authenticated by compar-ing the positions of the feature points extracted from the targeted image withthose decrypted from the previously encrypted digital signature. Other videoauthentication algorithms based on digital signatures can be found in [13]and [20].

Another widely used video authentication approach is watermarking inwhich a watermark is embedded in the multimedia data imperceptibly with-out changing the video content. In watermarking, any manipulation on thewatermarked data also changes the content of the embedded watermark. Wa-termarking based authentication algorithm examines the variations in theextracted watermark to verify the integrity of multimedia data. Mobasseriand Evans [12] proposed frame-pair concept based watermarking algorithmin which information from one video frame is watermarked in another frameusing a specific sequence and a key. Cross and Mobasseri [3] further proposedwatermarking based authentication algorithm for compressed videos. Yin andYu [24] proposed authentication algorithm for MPEG videos in which au-thentication data is embedded at GOP level. An object based watermarkingscheme for video authentication is proposed by He et al. [6] in which back-ground features are used as watermark and foreground objects are used ascover data. In [17] and [18], error correcting code based watermarking algo-rithm is used to perform end-to-end video authentication. Other watermarkingalgorithms for video authentication can be found in [7] and [19].

Apart from digital signature and watermarking based algorithms, anotheralgorithm for digital video authentication is proposed in [23] in which motion

516 M. Vatsa et al.

Table 1. Challenges with the existing video authentication algorithms

Video References Challengesauthenticationcategory

Digital signature [1, 2, 4, 5, 10], If digital signature is compromised then it[11,13,14,20,22] is easy to deceive the authentication system

Watermarking [3, 6, 7, 12], Embedding may alter the content of video[17–19,24] which is not permissible in the court of law

Other [9, 15,23,25] These algorithms are tailored for specificattacks only

trajectory and cryptographic secret sharing techniques are used. In this algo-rithm, different shots are segmented from a given video and the key framesin a shot are selected based on the motion trajectory. A secret frame is con-structed and used as the secret key of a particular shot. A master key forthe entire video is then generated using different secret keys computed for allthe shots. Authenticity of a video is determined using the computed masterkey. Similar approaches have been proposed in [9] and [25]. Quisquater [15]proposed a video authentication algorithm in which special hash functions areused to authenticate the edited video.

There are different challenges with the existing video authenticationapproaches. Table 1 illustrates the main issues with the existing algorithms.With digital signature based algorithms, if the location where digital signa-ture is stored is compromised then anyone can deceive the authenticationsystem. With watermarking based approaches, inconsequential informationmay be altered because these algorithms embed a watermark in the videodata. However, in court of law this alteration leads to disqualification of thevideo as evidence. Other authentication techniques are adapted to detect spe-cific attacks only. For example, motion trajectory based algorithm [23] onlydetects the frame addition or deletion attacks. Moreover existing algorithmsare also affected by compression and scaling operations. To address these chal-lenges we propose an effective video authentication algorithm which computesthe salient local information in digital video frames and establishes a relation-ship among the frames. This relationship is termed as the relative correlationinformation and is further used to authenticate the video data. A support vec-tor machine (SVM) [21] based learning algorithm is then used to classify thevideo as tampered or non-tampered. The proposed algorithm does not requirecomputation and storage of any key or embedding of secret information in thevideo data. The algorithm uses inherent video information for authentication,thus making it useful for real world applications. The algorithm is validatedusing a database of 795 tampered and non-tampered videos and the results


show a classification accuracy of 99.92%. Section 2 presents a brief overview ofSVM and the proposed algorithm is described in Sect. 3. Experimental resultsand discussion are summarized in Sect. 4.

2 Overview of Support Vector Machine

Support vector machine, proposed by Vapnik [21], is a powerful methodologyfor solving problems in non-linear classification, function estimation and den-sity estimation [16]. SVM starts from the goal of separating the data with ahyperplane and extends this to non-linear decision boundaries. SVM is thusa classifier that performs classification tasks by constructing hyperplanes ina multidimensional space and separates the data points into different classes.To construct an optimal hyperplane, SVM uses an iterative training algorithmto maximize the margin between two classes [16]. The remaining section de-scribes the mathematical formulation of SVM.

Let {xi, yi} be a set of N data vectors with xi ε �d, yi ε (+1,−1), andi = 1, ..., N . xi is the ith data vector that belongs to the binary class yi.Generalized decision function can be written as

f(x) =N∑

i=1

wiϕi(x) + b = Wϕ(x) + b, (1)

where ϕi(x) is a non-linear function representing hidden nodes and ϕ(x) =[ϕ1(x), ϕ2(x), ..., ϕN (x)]T . To obtain a non-linear decision boundary whichenhances the discrimination power, we can rewrite the above equation as

f(x) =N∑

i=1

yiαiK(x, xi) + b. (2)

Here K(x, xi) is the non-linear kernel that enhances the discrimination powerand αi is the Lagrangian multiplier. The basic idea behind non-linear SVM isto use a kernel function K(x, xi) to map the input space to the feature spaceso that the mapped data becomes linearly separable. One example of suchkernel is the RBF kernel

K(x, xi) = exp(−γ||x − xi||2), γ > 0, (3)

where x and xi represent the input vectors and γ is the RBF parameter. TheLagrange multipliers αi are determined by maximizing L(α) to

∑Ni=1 αiyi = 0

and 0 ≤ α ≤ C, i = 1, ..., N . Therefore,

L(α) =N∑

i=1

αi −12

N∑i=1

N∑j=1

(αiαjyiyjK(xi, xj)) (4)

and C is the factor used to control the violation of the safety margin rule.Additional details of SVM can be found in [21].

518 M. Vatsa et al.

3 Proposed Video Authentication Algorithm

As discussed earlier, common attacks on a video for tampering are: frame re-moval, frame addition, frame shuffling, and frame alteration. In this chapter,we have focused on the three attacks, frame addition, removal, and shufflingattack. However the proposed algorithm can handle all types of maliciousattacks. Since we are using SVM based learning and classification technique,it can also differentiate between attack and acceptable operations. Figure 6illustrates the concept of the proposed algorithm. The proposed video au-thentication algorithm computes the correlation information between twovideo frames. This information is computed locally using corner detectionalgorithm [8] and then classification is performed using support vector ma-chine [21]. The algorithm is divided into two stages: (1) SVM training and (2)tamper detection and classification using SVM.

3.1 SVM Training

First step in the proposed algorithm is to train the SVM so that it can classifythe tampered and non-tampered video data. Training is performed using amanually labeled training video database. If the video in the training data istampered, then it is assigned the label −1 otherwise (if it is not tampered)the label is +1. From the training videos, relative correlation information areextracted. This labeled information is then used as input to the SVM whichperforms learning and generates a non-linear hyperplane that can classifythe video as tampered and non-tampered. The steps involved in the trainingalgorithm are explained in the Training Algorithm.

Local RelativeCorrelation

Video Frames

SVMClassification

Tamper/Non-tamper

Fig. 6. Block diagram of the proposed video authentication algorithm

Training Algorithm

Input: Labeled training video data.Output: Trained SVM with a non-linear hyperplane to classify tampered andnon-tampered videos.Algorithm:

1. Individual frames are obtained from the video data.2. Corner points are computed from the first and second frame of the video

using corner detection algorithm [8].


3. Number of corner points in both these frames may be different, so anoptimal set of corresponding corner points is computed using the localcorrelation technique. In the first and second frames, windows of size 11×11 and 15×15 pixels respectively are chosen around the corner points. Thewindow size for the second frame is greater in order to provide toleranceto the errors which may occur during corner point computation.

4. One to many local correlation is performed on both the frames. Every win-dow of the first frame is correlated with every window of the second frame.Window pairs that provide the maximum correlation are then selected.

5. To handle incorrect corner pairs, we select only those pairs that havesimilar coordinate positions and the correlation value is greater than 0.6.

6. Let the local correlation between two frames be Li, where i = 1, 2, . . . ,mand m is the number of corresponding corner points in the two frames.We define the relative correlation information RCjk between two videoframes j and k as,

RCjk =1m

m∑i=1

Li (5)

7. Similar to Steps 2–6, relative correlation information RCjk is captured forall adjacent video frames of the video, such as RC12, RC23, and RC34.This relative correlation information is combined to form a column vectorof size (n − 1) × 1, where n is the number of frames in the video.

8. Steps 1–7 are performed on all the labeled training video data and relativecorrelation information RC is computed for each video.

9. Relative correlation information and labels of all the training video dataare provided as input to the Support Vector Machine.

10. SVM [21] is trained to classify the tampered and non-tampered data.Output of SVM training is a trained hyperplane with classified tamperedand non-tampered data.

3.2 Tamper Detection and Classification Using SVM

We now describe the proposed tamper detection and classification algorithm.Input to the tamper detection algorithm is a video data whose authenticityneeds to be established. Similar to the training algorithm, relative correlationinformation between frames are extracted and the trained SVM is used toclassify the video. If the SVM classifies the input video as tampered then thelocation of tampering is computed. Steps of the tamper detection algorithmare described below.

Tamper Detection

Input: Unlabeled Video dataOutput: Classification result as tampered and non-tampered videoAlgorithm:

520 M. Vatsa et al.

1. Compute the relative correlation information RC for the input video using Steps1-7 of the training algorithm.

2. Relative correlation information of the input data is projected into the SVMhyperplane to classify the video as tampered or non-tampered. If the output ofSVM is greater than zero, then the input video is tampered otherwise it is not.

3. If the video is classified as tampered, then we determine the particular framesof the video that have been tampered.

4. Plot the relative correlation information, RCjk of all the adjacent frames of thevideo, here j = 1, 2, ..., n − 1 and k = 2, 3, ..., n.

5. Correlation values showing the maximum deviation in the plot are the valuescorresponding to the tampered frames.

Figure 1 shows video frames from a tampered video that has been subjectedto frame addition attack and a new frame has been inserted at position 11.Figure 7 shows the plot of relative correlation information values for the first50 frames of the tampered video. The plot shows that the relative correlationvalue between the 10th and 11th frames and the 11th and 12th frames issignificantly lower compared to the relative correlation value between otherframes. Since 11 is the common frame which leads to lower relative correlationvalues, so frame number 11 is detected as the tampered frame.

5 10

Frame Number

Rel

ativ

e C

orre

latio

n In

form

atio

n

15 20 25 30 35 40 45 500.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Rj’-2j’-1

Rj’+1j’+2

Rj’-1j’Rj’j’+1

1

Fig. 7. Plot of relative correlation information of a tampered video in which the11th frame has been tampered



The proposed tamper detection algorithm is validated using a video databasewhich contains 25 videos. Experimental protocols for validation are as follows:

1. Video database contains 25 original non-tampered videos with 300 frameseach captured at 15 fps. This video data is used as the ground truth. Foreach of the 25 videos, different copies are created by subjecting them todifferent video tampering attacks. Details of the database are providedbelow:• For each video, 20 copies are created with the frame dropping attack

in which 1–20 frames have been dropped at random positions.• Twenty copies of each video are created for the frame shuffling attack

in which the positions of two or more frames in the video are shuffled.• For the frame addition attack, we first chose a video other than the

25 videos in the database. Frames of this additional video are insertedat random positions in the database videos to generate 20 tamperedcopies of each ground truth videos.

• We thus have 25 ground truth videos, 500 videos with the frame drop-ping attack, 500 videos with the frame shuffling attack, and 500 videoswith the frame addition attack.

2. Ten videos from the ground truth and 750 tampered videos are used totrain the support vector machine. These 750 tampered videos contain 250videos from each of the three attacks.

3. The remaining 15 ground truth videos and 750 tampered videos are usedas the probe database to determine the performance of the proposed al-gorithm.

4. From the 15 ground truth videos of the probe database, 30 copies aregenerated by applying MPEG compression and scaling operation. MPEGcompression reduces the size of the video by 75% whereas scaling oper-ation reduces the dimension of the video by 50%. This dataset is usedto evaluate the performance for acceptable video processing operations.These videos are also treated as non-tampered because the content of thevideo are intact. Thus, there are 45 non-tampered videos for performanceevaluation.

With this experimental protocol, we evaluated the performance of theproposed video authentication algorithm. All the computations are performedusing P-IV 3.2 GHz computer with 1 GB RAM under MATLAB programmingenvironment. The RBF parameter used in the proposed algorithm is computedempirically using the 760 training videos. During SVM training, we first setγ = 1 and compute the classification accuracy. We then increase the valueof γ to 2, 3, 4, 5, and 6 and compute the classification accuracies for all thevalues of γ. The value of γ = 4 yields the maximum classification accuracy.We therefore used γ = 4 for classification on the probe data.

522 M. Vatsa et al.

Table 2. Classification results of the proposed video authentication algorithm fortampered and non-tampered videos

Attacks Total number Number of Classificationof videos correctly classified accuracy (%)

Non-tampered 45 45 100Frame addition 250 250 100Frame removal 250 248 99.2Frame shuffling 250 246 98.4Total 795 789 99.2

Table 2 summarizes the results of the proposed video authenticationalgorithm. For non-tampered videos and videos subjected to the frame ad-dition attack, the proposed algorithm does not make any error and yields100% correct classification. Here, we observed that on the 30 non-tamperedvideo data which are subjected to acceptable MPEG compression and scalingoperations, the proposed algorithm correctly classifies them as non-tampered.This result shows that the proposed algorithm can handle the video processingoperations which do not change the integrity of the video data. For frame re-moval and shuffling attacks, we obtained the classification accuracy of 99.2and 98.4%, respectively. Thus, the overall classification accuracy of theproposed algorithm is 99.2%. For frame removal and shuffling attacks, theproposed algorithm misclassified six tampered videos because the differencebetween the frames was very small and even after tampering, the relativecorrelation values were high without any signification deviation.

We further analyzed the values of relative correlation information RCfor non-tampered and tampered video streams (Figs. 8–11). Figure 8 showsthe values for two non-tampered videos. The values of relative correlationfor non-tampered videos are in the range of 0.65–0.95. For frame additionattack, Fig. 9 shows the relative correlation values of the tampered videos.The relative correlation values involving tampered frames lie between 0.1 and0.3 and are much lower than the RC values of the non-tampered frames.Frame removal and shuffling attacks also yield lower relative correlation val-ues and are shown in Figs. 10 and 11, respectively. As described in Steps 3–5of the proposed tamper detection algorithm, analyzing the RC values ob-tained from the video provides the specific frames that have been altered.The algorithm successfully determined altered frames in all the videos exceptthe six misclassified videos. These results show the efficacy of the proposedvideo authentication algorithm for the three video tampering attacks namely,frame addition, frame removal, and frame shuffling. We next evaluated theperformance of the proposed algorithm for the frame alteration and otherattacks. For this experiment, we used a commercial software to remove andadd objects in the frames. We prepared two such videos files: one video withobject removal and one video with object addition. For other attacks (noiseaddition and blurring), we created two tampered videos affected from these


0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame Number

Frame Number

Rel

ativ

e C

orre

latio

n In

form

atio

nR

elat

ive

Cor

rela

tion

Info

rmat

ion

Fig. 8. Plots showing the relative correlation information of non-tampered videos.Relative correlation values for such videos lie in the range of 0.65–0.95

524 M. Vatsa et al.

0 50 100 150 200 250 300

Frame Number

Frame Number

Rel

ativ

e C

orre

latio

n In

form

atio

nR

elat

ive

Cor

rela

tion

Info

rmat

ion

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300

Fig. 9. Examples of relative correlation values from videos subjected to the frameaddition attack. In the first example, one frame has been added in the video whereasin the second example, eight new frames have been added


Frame Number

Frame Number

Rel

ativ

e C

orre

latio

n In

form

atio

nR

elat

ive

Cor

rela

tion

Info

rmat

ion

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300

Fig. 10. Examples of relative correlation values from videos subjected to the frameremoval attack. In the first example, one frame has been deleted from the videowhereas in the second example, four frames have been deleted

526 M. Vatsa et al.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150

Frame Number

Rel

ativ

e C

orre

latio

n In

form

atio

nR

elat

ive

Cor

rela

tion

Info

rmat

ion

Frame Number

200 250 300

0 50 100 150 200 250 3000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 11. Examples of relative correlation values from videos subjected to the frameshuffling attack. In the first example, positions of two frames are shuffled whereasin the second example, position of four frames are shuffled


Table 3. Theoretical and experimental comparison of the proposed video authen-tication algorithm with the motion trajectory based video authentication algo-rithm [23]

Motion trajectory based Proposed relative correlationvideo authentication information algorithm

Basic concept Motion trajectory based Local relative correlation aremaster key computation computed from the frames

Classification Using empirical thresholds and Using non-linear supportcosine correlation measure vector machine

Advantage Simple algorithm, handles frame Handles all the five attacksaddition and removal attacks mentioned in Sect. 1

Disadvantage It can not handle shuffling, Computationally expensivealteration and other attacks, algorithm due to the useacceptable operations are of SVMclassified as tampered

Accuracy (%) 80.7 99.2

Average 16.59 20.67time (s)

attacks. The proposed algorithm is able to correctly classify these videos astampered and detect the location of the tampered frames also.

We also compared the performance of the proposed video authentica-tion algorithm with the motion trajectory based video authentication algo-rithm [23]. Table 3 shows the theoretical and experimental comparison. Motiontrajectory algorithm is fast and simple but not able to detect frame shuffling,alteration, and other attacks. On the other hand the proposed algorithm usesintelligent SVM classification algorithm which is able to detect all the attacks.Experimentally, we found that the proposed algorithm outperforms the mo-tion trajectory based algorithm by 18.5%. However, the proposed algorithmis around 4 s slower than the motion trajectory algorithm. The computationaltime includes the time taken to read the video frames (300 frames), extractthe features, and perform classification. Thus, with the proposed algorithm,a large gain in the classification accuracy is observed with a minor increase incomputational time.

5 Conclusion

Video authentication is a very challenging problem and of high importancein several applications such as presenting video evidence in court of law andvideo surveillance. Existing video authentication algorithms use watermarking

528 M. Vatsa et al.

or digital signature based algorithms. Digital signature based algorithms canbe deceived if the digital signature is compromised and watermarking basedalgorithms are not acceptable in court of law because they have been alteredduring watermark embedding and extraction. To address these issues, we haveproposed an efficient video authentication algorithm which can detect multi-ple video tampering attacks. The proposed algorithm computes the relativecorrelation information between all the adjacent frames of a video and projectsthem into a non-linear SVM hyperplane to determine if the video is tamperedor not. The algorithm is validated on an extensive video database containing795 tampered and non-tampered videos. The results show that the proposedalgorithm yields a classification accuracy of 99.2%. In future, we would liketo extend the proposed algorithm for rapid camera movement and multi shotvideo tampering.

References

1. Bhattacharjee S, Kutter M (1998) Compression tolerant image authentication.In Proceedings of IEEE International Conference on Image Processing, 1:435–439

2. Celik MU, Sharma G, Tekalp AM, Saber E (2002) Video authentication with selfrecovery. In Proceedings of Security and Watermarking of Multimedia ContentsIV, 4314:531–541

3. Cross D, Mobasseri BG (2002) Watermarking for self authentication of com-pressed video. In Proceedings of IEEE International Conference on ImageProcessing, 2:913–916

4. Diffie W, Hellman ME (1976) New directions in cryptography. IEEE Transac-tions on Information Theory, 22(6):644–654

5. Ditmann J, Steinmetz A, Steinmetz R (1999) Content based digital signaturefor motion pictures authentication and content fragile watermarking. In Pro-ceedings of IEEE International Conference on Multimedia Computing and sys-tems, 2:209–213

6. He D, Sun O, Tian O (2003) A semi fragile object based video authenticationsystem. In Proceedings of International Symposium on Circuits and Systems,3:814–817

7. He D, Sun O, Tian O (2004) A secure and robust object-based video authenti-cation system. EURASIP Journal on Applied Signal Processing, 14:2185–2200

8. Kovesi PD (1999) Image features from phase congruency. Videre: Journal ofComputer Vision Research, 1(3)

9. Latechi L, Wildt D, Hu J (2001) Extraction of key frames from videos byoptimal color composition matching and polygon simplification. In Proceedingsof Multimedia Signal Processing, 245–250

10. Lin CY, Chang SF (1999) Issues and solutions for authenticating MPEG video.In SPIE Electronic Imaging Security and Watermarking of Multimedia Con-tents, 3657:54–65

11. Lu CS, Liao HYM (2003) Structural digital signature for image authentication:An incidental distortion resistant scheme. IEEE Transactions on Multimedia,5(2):161–173


12. Mobasseri BG, Evans AE (2001) Content dependent video authentication byself water marking in color space. In Proceedings of Security and Watermarkingof Multimedia Contents III 4314:35–46

13. Pramateftakis A, Oelbaum T, Diepold K (2004) Authentication of MPEG-4-based surveillance video. In Proceedings of IEEE International Conference onImage Processing, 1:33–37

14. Queluz MP (1998) Toward robust, content based techniques for image au-thentication. In Proceedings of IEEE Second Workshop on Multimedia SignalProcessing, 297–302

15. Quisquater J (1997) Authentication of sequences with the SL2 hash functionapplication to video sequences. Journal of Computer Security, 5(3):213–223

16. Singh R, Vatsa M, Noore A (2006) Intelligent biometric information fusion usingvector machine. In Soft Computing in Image Processing: Recent Advances,Springer Verlag, 327–350

17. Sun Q, Chang FS, Maeno K (2002) A new semi fragile image authenticationframework combining ECC and PKI infrastructure. In Proceedings of IEEEInternational Symposium on Circuits and Systems, 2:440–443

18. Sun Q, He D, Zhang Z, Tian Q (2003) A secure and robust approach to scalablevideo authentication. In Proceedings of International Conference on Multimediaand Expo, 2:209–212

19. Thiemert S, Sahbi H, Steinebach M (2006) Using entropy for image and videoauthentication watermarks. In Proceedings of the SPIE Security, Steganogra-phy, and Watermarking of Multimedia Contents VIII, 6072:470–479

20. Uehara T, Naini RS, Ogunbona P (2004) An MPEG tolerant authenticationsystem for video data. In Proceedings of IEEE International Conference onMultimedia and Expo, 2:891–894

21. Vapnik VN (1995) The nature of statistical learning theory. Springer Verlag,Berlin

22. Wohlmacher P (1998) Requirements and mechanism of IT-security includingaspects of multimedia security. In Proceedings of Multimedia and SecurityWorkshop at ACM Multimedia, 11

23. Yan WQ, Kankanhalli MS (2003) Motion trajectory based video authentication.In Proceedings of International Symposium on Circuits and Systems, 3:810–813

24. Yin P, Yu HH (2001) Classification of video tampering methods and coun-termeasures using digital watermarking. In Proceedings of SPIE MultimediaSystems and Applications IV, 4518:239–246

25. Zhao L, Qi W, Li S, Yang S, Zhang H (2002) Key frame extraction and shotretrieval using Nearest Feature Line (NFL). In Proceedings of InternationalWorkshop on Multimedia Information Retrieval, in conjunction with ACMMultimedia Conference, 217–220

Index

C-intersection, 217D-union, 2173D Face Reconstruction, 458, 4593D virtual world, 793GB (third generation and beyond)

mobile wireless networks, 72

ABR, 63, 66Acoustic characteristics, 85acquisition, 109Active Appearance Model (AAM), 472Active markers, 170Adaptive Weight Approach (AWA), 334Adaptive weight approach (AWA), 340aiNet, 386

idiotypic network visualization, 390redundant antibodies, 387robust antibody construction, 388stimulated antibody search, 389time-dependent parameters, 389

algorithmic techniques, 419amplifiers, 109Anthropocentric, 478, 479ARMAX, 66artificial immune system, see aiNetartificial neural network (ANN), 52artificial neural networks, 431ATM, 58audio similarity measure, 222Audio–visual speech recognition (ASR),

451Australia, 101Australian Bureau of Statistics, 101

available bit rate (ABR), 63Avatar, 79

back-propagation learning (BPL), 52bar charts, 191Bark scale, 85Bayes, 100Bayesian, 421Bayesian network, 202Belief Networks, 426Belief-Desire-Intention (BDI) model,

358Body suit, 173body–machine interface (BMI), 104bottom-up, 237, 252, 253, 255brain, 104

CAC, 58, 59Call admission control, 58call admission control, 51Candide face model, 459, 460Canny operator, 239CBR (Constant Bit Rate), 66CCD, 105cell loss rate, 58Characteristics of Optical Motion

Capture Data, 172chemotherapy, 101Classifier, 471

Adaboost, 471Bayes, 471Linear Programming, 471SVM, 471

clinical, 102

532 Index

CLR, 58Clustering, 496

Fuzzy C-means (FCM), 465clustering, 238–240, 247, 248, 252, 265co-occurrence matrix, 236, 244, 245cognitive, 101communicative signals, 203Comparisons: Modified K-means vs.

Median Filter, 182Computational Intelligence, 233, 241,

252, 266, 267computer vision, 235conjunctor, 217connected components, 237, 256, 258connection admission control, 58Content-based organization, 88contextual networks, 382continuous media, 140, 142–144, 146,

161continuous task scheduling problem,

328convolution, 244, 245

D-symbol, 146, 147Data fusion, 352, 358defuzzification, 246, 250defuzzifier, 56Dempster–Shafer, 411Description Schemes, 479

ActorAppearanceType DS, 483ActorInstanceType DS, 483ObjectAppearanceType DS, 485ObjectInstanceType DS, 485

detection, 410devices, 102Directory-based mapping, 88disability, 101discrete, 107disjunctor, 217distortions, 101distributed wireless sensor networks,

411document image, 233, 235, 237, 239,

242, 246–248, 253–255, 257, 261,265, 266

analysis, 233, 235, 266DS/CDMA (Direct Spread-spectrum

CDMA) protocol, 71dual, 218

dynamic Bayesian network (DBN), 471dynamic link architecture (DLA), 467

Earliest deadline first (EDF), 321earliest deadline first (EDF), 320, 321,

347edge

filter, 239points, 239strength, 252, 255–257

edge-based methods, 237–239Eigenfaces, 448Elastic graph Matching

Morphological (MEGM), 468Elastic graph matching

morphological (MEGM), 467normalized (NMEGM), 469

elastomer, 108electrical, 109Encapsulation, 480encoding, 51, 68engineer, 104Estimate and plug detector, 457Euclidian distance, 465Evolutionary algorithms

Estimation of distributions algo-rithms, 369

evolutionary algorithms, 369evolution strategies, 360

Evolutionary computation, 57Evolutionary Programming, 241Experiment results, 178Eye Center Localization, 450Eye Region Detection, 449

fabric, 109Face Clustering, 461Face Detection, 440

color, 441color spaces, 443

Harr-Like Features, 444Harr-like features, 441methods, 440

Face Recognition, 466facial action coding system (FACS), 472Facial Action Units (FAUs), 472fault-event disambiguation, 421feature, 215

Index 533

feature extraction, 215, 236, 241,253–255, 258

feature vector, 215

feed-forward neural network (FFNN),52

FFNN, 52

Fibonacci sequence, 254

Finite Element Method (FEM), 459,460

FIPA-ACL, 358

Fisher’s linear discriminant (FLD), 467

fluctuation pattern, 216

Fourier Transform (FT), 254

Frequency Division Duplex (FDD), 71

fundamental matrix, 459

Fusion, 444

Future work, 186

fuzzifier, 56

fuzzy

clustering, 243

co-occurrence matrix, 244, 245

histogram, 243–245

inference, 245, 250, 254

logic, 235, 241, 242, 249, 266

number, 243–245

partition, 246

rule, 241, 245, 246, 249–252, 256, 261,262

segmentation, 235, 255

set, 241–243, 245, 246, 250, 251

singleton, 242, 243, 250, 251

system, 245, 246, 248, 249

fuzzy aggregation operator, 218

fuzzy audio similarity measure, 224

Fuzzy C-Means (FCM), 247–249, 253,265

fuzzy comparison measure, 219

Fuzzy Expert System, 55

fuzzy inclusion measure, 219

fuzzy logic, 55

fuzzy resemblance measure, 219

fuzzy rules, 131

fuzzy set, 217

fuzzy set theory, 55

fuzzy similarity measure, 219

Fuzzy system, 366

fuzzy timing Petri Net model (FTPNM),70

GaborGabor features, 471Gabor filters, 471Gabor wavelet representation, 472

Gabor filter, 239, 240Game engine, 86garment, 108Gaussian function, 239, 250Gaussian noise experiment, 179Gaussian pyramid, 256, 257, 261, 264Generalized Projection Functions

(GPF), 448generation, 108genetic algorithm, 253, 260, 264Genetic algorithm (GA), 322genetic algorithm (GA), 66, 320, 322,

342, 344, 347global normalized average rank, 227graph understanding, 200

H.261, 70handoff, 61Hausdorff distance, 448Hidden Markov Models (HMMs), 473hidden Markov models (HMMs)

multi-level, 471hippocampus, 100, 103histogram, 236, 238, 243, 244histograms, 391

context adaptation, 395contextual document membership,

394contextual term significance, 393

HNN, 61Hopfield neural net (HNN), 53Hough transform, 253, 255, 256hybrid GA (hGA), 348hybrid Genetic Algorithm combined

Simulated Annealing (hGA+SA),346

I-automaton, 153, 155I-constraint, 144, 147, 148I-NFA, 157I-normal form, 147I-regular expression, 144I-string, 143I-symbol, 146, 147Illuminated Contour-Based Markes, 167

534 Index

image

digitalisation, 235, 241

processing, 233, 235, 243, 256, 266

segmentation, 233–238, 241, 248, 249,252, 266

immersive, 100

impairment, 102

impedance, 109

incremental adaptation, 395

independent component analysis (ICA),472

information graphics, 191

Inheritance, 480

intellectual, 101

intelligent CAC (ICAC), 60

intelligent multiple access controlsystem (IMACS), 62

Intelligent Visual Tracking Systems

Association process, 354

Evaluation process, 360

Foreground detection, 353, 359

Tracking process, 364

Intelligent visual tracking systems, 353

intended message, 192

interactive storytelling, 120

IP (Internet Protocol) based multimediacommunications, 72

IRED, 105

Iyengar–Krishnamachari method, 411

JADEX, 358

Joint entropy, 461, 462

K-Means, 247

Kalman filter, 366, 472

Kanade–Lucas–Tomasi (KLT), 446, 477

kinematics, 106

labelled graph vector (LGV), 471

linear discriminant analysis (LDA), 472

Linear Edge Mapping (LEM), 472

linguistic variables, 55

Lip-reading, 451

local features analysis (LFA), 472

local principal components (LPC), 472

localization, 410

lycra, 109

Marker based tracking, 169Marker placement, 174Markov decision processes (MDP), 64Massively Multi-User Online Role-

Playing Games, 79material, 109media composition, 51media streaming, 51, 68media synchronization, 51, 70medicine, 101Mel spectrogram, 215membership function, 242–250mobility, 102Modified K-means Algorithm, 176modified K-means algorithm, 167Modified proportional share (mPS), 322modified proportional share (mPS),

320, 322, 347Motion capture, 168Motion capture images, 175motion-compensated discrete cosine

transform (MC-DCT), 69motor, 102MPEG (Moving Picture Experts

Group), 69multi-layer perceptron (MLP), 52, 448multi-objective pareto ranking scheme,

72multi-scale, 239, 240, 255, 256Multi-target tracking, 353Multiagent system, 353multicast routing, 64multicasting, 51Multimedia

retrieval, 494multimedia services/servers, 51Multimedia streaming server, 90Multimedia surveillance systems, 351multimodal document, 191, 192multiobjective Genetic Algorithm

(moGA), 334Mutual information, 461, 462

normalized (NMI), 462, 464

neocortex, 100, 103neural computation, 266neural fuzzy CAC, 60neural network, 241, 249, 250, 471neural network (NN), 52

Index 535

neuro-fuzzy, 56, 233, 235, 241, 243,249–257, 261, 262, 266, 267

neuro-fuzzy scheduler (NFS), 72Neuro-fuzzy technique, 368neurorehabilitation, 102Noise Estimation, 456Noise filtering, 179non-linearity, 109Non-negative Matrix Factorization

(NMF)DNMF, 475

non-negative matrix factorization(NMF), 475

normalization, 224normalized average rank, 226

Object Oriented Framework (OOF),480

Object Oriented Programming (OOP),480

Objects, 481Optical Character Recognition (OCR),

234Optical motion capture, 169orthogonal GA, 66

pagelayout, 252, 253segmentation, 235, 258

pairwise reinforcement of featureresponses (PRFR), 448

Passive markers, 169perception, 99perceptual content, 233period based Genetic Algorithm

(pd-GA), 332phobias, 101physiotherapy, 102piezoresistive, 109pipeline recurrent neural net, 61pixel

classification, 256, 261–263plan recognition, 202pointwise extension, 218pondering function distribution, see

histogramsPower Spectral Density (PSD), 253, 254PPT, 105Pre-processing, 175

principal component analysis (PCA),471

probability theory, 425Proposed solution to drawbacks with

Classical markers, 171psychology, 101

quasi-statical, 109

radio resource manager (RRM), 63random neural network (RNN), 54rate monotonic (RM), 320, 321, 347Rate regulating proportional share

(rrPS), 320, 322rate regulating proportional share

(rrPS), 347rate regulating proportional

share(rrPS), 322real-time task, 321Recognizing coloured line segments, 178recurrent back-propagation (RBP), 53recursive least squares (RLS) predictor,

71region-based methods, 237rehabilitation, 99reinforcement learning (RL), 55relaxation, 109Removing noise in images with spherical

markers, 184Removing synthetic and real spike

noise, 179resistance, 109Rhythm Pattern, 85Richardson’s Arms Race model, 123RNN, 54robotics, 102, 106rotation, 104Rough set theory, 56routing, 51RRM, 63

Second Life, 79self organized feature map(SOFM), 54self-organizing feature map, 68Self-Organizing Map, 496

Growing, 498Self-organizing map, 83semantic content, 233semi-cylindrical, 104

536 Index

semi-Markov decision problem (SMDP),61

sensor, 109sensor shirt, 108sensorimotor, 101series, 109sigma count, 217Similarity matrix, 464simulated annealing (SA), 326, 348skew angle, 253–255, 257, 258Skin segmentation, 443Sobel operator, 256, 257SOFM, 54, 68Soft real-time tasks scheduling problem

(sr-TSP), 334SOM, 54, 83Sonogram, 85sonogram, 215Spatial metaphor, 80Speaker location, 451Spectrogram, 85spectrogram, 215spectrum histogram, 215Speech intent detection, 451spinal-cord, 102spline, 263Steiner tree, 68strain, 109stroke, 102summarization, 191, 199support, 217Support vector machine, 517surgical, 101synaptic, 103synchronization, 51, 68

Takagi-Sugeno-Kang (TSK), 245, 250test collection, 226text

detection, 233, 234information, 233–236, 246, 256, 266localisation, 233–235, 237–239, 249,

266, 267recognition, 233, 234tracking, 234understanding, 234

texture, 233, 236, 239, 255features, 239, 240, 244segmentation, 239

texture-based methods, 237, 239

The Illuminated Contour-Based MarkerSystem, 172

The MediaSquare, 80thresholding, 238, 240, 241, 257top-down, 237, 252Tracking, 440

active, 440passive, 441Region Based, 446

transduction, 109translation, 104triangular conorm, 218triangular norm, 218

UDP, 110uncertainty, 411usage parameter control (UPC), 64User Interaction, 505

VBR (Variable Bit Rate), 66Vector-based description, 84vestibular, 100Video

Browsing, 500Retrieval, 500

Video Data Association Problem, 366video deinterlacing, 71virtual, 100virtual reality, 99vision, 99Visual Sensor Networks, 352

Multiagent visual sensor network,353, 356

Visual Speech Recognition (VSR), 451,453

Visualization, 499VR, 104

wavelet, 239, 240, 258, 263coefficients, 258–260decomposition, 240, 258functions, 240, 264packet, 240, 241, 253, 258

wheelchair, 105wheelchairs, 104wideband code division multiple access

(WCDMA), 61world, 101wrist, 109

xPC-Target, 110