download.e-bookshelf.de€¦ · preface this volume contains revised versions of selected papers...

��

�� !��"� #�� $� %� �� &�'� (� �� $��

&� �� )� �� $� #�� *�� !�+��,� $��'�� *�� $�� !�� -��!� ��'�� -�� +�� #� &�� $� � �� $��'� "�� '��

��

�� "� *�� $�$� &� ��.)��/0��'�� '� �� 1223 .�� +��/

)� �� 4� *� ��5�� $� � �� .)��/!�� ++�� 1223 .�� +��/

"� #�� .)��/(��' �� 1226

�� "� �� .)��/�� 0��'�� '��1227

)� �� 4� *� ��5�� +��.)��/�� ' �� 1227

&� �� +�� .)��/�� 1228

� �� !� ��'�� 4�9�'��4� -�� 4� �� .)��/�� &�� $�� 122:

0� ��9�� &� $�� $� � ��.)��/�� 122:

�� &�� $� %� �� .)��/��5�� 122:

$� %� �� +�� .)��/�� 1222

"� #�� *� ��,�� .)��/�� 0��'�� 1222

�� )� �� .)��/�� ' �� ;<<<

��*� �� ,�� &�� ,� (� #�� $� � �� .)��/�� &�� $�� ;<<<

"� #�� +�� $� � �� .)��/�� ;<<<

&� �� "� #�� .)��/�� 0��'�� -�� $��'� ;<<<

�� &� &� �� $� %� �� $� � �� .)��/��5�� ;<<1

"� #�� #� &�� .)��/�� '�� !�� $�� ;<<;

�� ,�9�� =�� .)��/�� ;<<;

$� � �� +�� .)��/)>+�� )'+�� &�� ;<<?

$� � �� "� #�� $� %� �� .)��/�� ++�� ;<<?

�� $� �� $��.)��/��5�� $��5�� ;<<3

�� *� �� (�&� $ $�� "� #�� .)��/�� $�� ++� �� ;<<3

�� "�� .)��/0��5�� 0��'�� '�� ;<<6

$� %� �� $�� $�� $�� .)��/!�� 5��+'�� ;<<6

�� &� �� *� � �'��-��'� .)��/�� ++�� ;<<6

�� "�� @ "�� #��)��

�� A�� B �C�� ;:�� #�� D� �� %�B��5�� '�� $�� 2A11� ;<<3

"�� 1:1 (�� 1<: -� ��

� �

�� "��B��5��E� ��'��(� � �� 33;;1 ��'��F��'��

�� "�� #��B��5��E� �� .-�/0�� D� )�� B��'�� 871;: ��F��

0��! 13?1�::130��! ?�63<�;6788�7 �+��%�� !�� 4��

*� �� !�' ��G ;<<62;8136

-�� 9� � �� +�� 5�� +�� '�� +� �� +�� +�� '� ��' �� +� �� +� � �� +�� +��'�� +��5�� #��'�� +�� *�� +��' �� 2� 1276� �� 5�� +��'�� '�� ' �+��%�� %�� +�� #��'�� +�� *��

�+�� @ �� +�� H�� $��

�+�� '

I �+��%�� @ �� ;<<6�� #��'��

-�� +��5� ��'�� '�� '�� +� � �� '+�� 5�� +� �� '�� '�� >�'+� ��'�� 5�� +�� 5� ��

�� 5��G )��

��0! 11312?76 3?J?16? A 6 3 ? ; 1 < A �� +�+��

Preface

This volume contains revised versions of selected papers presented duringthe 28th Annual Conference of the Gesellschaft fur Klassifikation (GfKl), theGerman Classification Society. The conference was held at the UniversitatDortmund in Dortmund, Germany, in March 2004. Wolfgang Gaul chairedthe program committee, Claus Weihs and Ernst-Erich Doberkat were thelocal organizers. Patrick Groenen, Iven van Mechelen, and their colleaguesof the Vereniging voor Ordinatie en Classificatie (VOC), the Dutch-FlemishClassification Society, organized special VOC sessions.

The program committee recruited 17 notable and internationally renown-ed invited speakers for plenary and semi-plenary talks on their current re-search work regarding classification and data analysis methods as well as ap-plications. In addition, 172 invited and contributed papers by authors from 18countries were presented at the conference in 52 parallel sessions representingthe whole field addressed by the title of the conference “Classification: TheUbiquitous Challenge”. Among these 52 sessions the VOC organized sessionson Mixture Modelling, Optimal Scaling, Multiway Methods, and Psychomet-rics with 18 papers. Overall, the conference, which is traditionally designed asan interdisciplinary event, again provided an attractive forum for discussionsand mutual exchange of knowledge.

Besides the results obtained in the fundamental subjects Classificationand Data Analysis, the talks in the applied areas focused on various appli-cation topics. Moreover, along with the conference a competition on “SocialMilieus in Dortmund”, co-organized by the city of Dortmund, took place.Hence the presentation of the papers in this volume is arranged in the fol-lowing parts:

I. (Semi-)Plenary PresentationsII. Classification and Data Analysis

III. Applications, andIV. Contest: Social Milieus in Dortmund.

The part on applications has sub-chapters according to the different applica-tion fields Archaeology, Astronomy, Bio-Sciences, Electronic Data and Web,Finance and Insurance, Library Science and Linguistics, Macro-Economics,Marketing, Music Science, and Quality Assurance. Within (sub-)parts pa-pers are mainly arranged in alphabetical order with respect to (first) author’snames.

I.Plenary and semi-plenary lectures enclose both conceptual and applied

papers. Among the conceptual papers Erosheva and Fienberg present a fully

VI Preface

Bayesian approach to soft clustering and classification within a general frame-work of mixed membership, Friendly introduces the Milestones Project ondocumentation and illustration of historical developments in statistical graph-ics, Hornik discusses consensus partitions particularly when applied to ana-lyze the structure of cluster ensembles, Kiers gives an overview of proceduresfor constructing bootstrap confidence intervals for the solutions of three-waycomponent analysis techniques, Pahl argues that a classification frameworkcan organize knowledge about software components’ characteristics, and Uterand Gefeller define partial attributable risk as a unique solution for allocatingshares of attributable risk to risk factors. Within the applied papers Beranpresents preprocessing of musical data utilizing prior knowledge from musicol-ogy, Fischer et al. introduce a method for the prediction of spatial propertiesof molecules from the sequence of amino acids incorporating biological back-ground knowledge, Grzybek et al. discuss how far word length may contributeto quantitative typology of texts, and Snoek and Worring present the TimeInterval Multimedia Event framework as a robust approach for classificationof semantic events in multimodal soccer video.

II.

The second part of this volume is concerned with methodological progressin classification and data analysis and methods presented cover a variety ofdifferent aspects.

In the Classification part, more precise confidence intervals for the pa-rameters of latent class models using the bootstrap method are proposed(Dias), as well as a method of feature selection for ensembles that signif-icantly reduces the dimensionality of subspaces (Gatnar), and a sensitivetwo-stage classification system for the detection of events in spite of a noisybackground in the processing of thousands of images in a few seconds (Haderand Hamprecht). Variants of bagging and boosting are discussed, which makeuse of an ordinal response structure (Hechenbichler and Tutz), a methodologyfor exploring two quality aspects of cluster analyses, namely separation andhomogeneity of clusters (Hennig), and a comparison of Adaboost to Arc-x(h)for different values of h in the subsampling of binary classification data is car-ried out (Khanchel and Limam). The method of distance-based discriminantanalysis (DDA) is introduced finding a linear transformation that optimizesan asymmetric data separability criterion via iterative majorization and thenecessary number of discriminative dimensions (Kosinov et al.), an efficienthybrid methodology to obtain CHAID tree segments based on multiple de-pendent variables of possibly different scale types is proposed (Magidson andVermunt), and possibilities of defining the expectation of p-dimensional inter-vals (Nordhoff) are described. Design of experiments is introduced into vari-able selection in classification (Pumplun et al.), as well as the KMC/EDAMmethod for classification and visualization as an alternative to Kohonen Self-Organizing Maps (Raabe et al.). A clustering of variables approach extended

Preface VII

to situations with missing data based on different imputation methods (Sah-mer et al.), a method for binary online-classification incorporating temporaldistributed information (Schafer et al.), and a concept of characteristic re-gions and a new method, called DiSCo, to simultaneously classify and visu-alize data (Szepannek and Luebke) are described. The part concludes withtwo papers discussing multivariate Pareto Density Estimation (PDE), basedon information optimality, for data sets containing clusters (Ultsch) and anextension of standard latent class or mixture models that can be used for theanalysis of multilevel and repeated measures data (Vermunt and Madgison).

The part on Data Analysis starts with papers proposing a robust pro-cedure for estimating a covariance matrix under conditional independencerestrictions in graphical modelling (Becker) and a new approach to find prin-cipal curves through a multidimensional, possibly branched, data cloud (Ein-beck et al.). A three–way multidimensional scaling approach developed toaccount for individual differences in the judgments about objects, persons orbrands (Krolak-Schwerdt), and the Time Series Knowledge Mining (TSKM)framework to discover temporal structures in multivariate time series basedon the Unification-based Temporal Grammar (UTG) (Morchen and Ultsch)are introduced. A framework for the comparison of the information in contin-uous and categorical data (Nishisato) and an external analysis of two-modethree-way asymmetric multidimensional scaling for the disclosure of asymme-try (Okada and Imaizumi) are presented. Finally, nonparametric regressionwith the Relevance Vector Machine under inclusion of covariate measurementerror (Rummel) is described.

III.In the third part of this volume all contributions are also related to ap-

plications of classification and data analysis methods but structured by theirapplication field.

Two papers deal with applications in Archaeology. The first is a his-torical overview (Ihm) over early publications about formal methods on seri-ation of archaeological finds, in the second article some cluster analysis mod-els including different data transformations in order to differentiate betweenbrickyards of different areas on the basis of chemical analysis are investigated(Mucha et al.).

Another two papers (both by Bailer-Jones) discuss applications in As-tronomy. A brief overview of the upcoming Gaia astronomical survey mis-sion, a major European project to map and classify over a billion stars in ourGalaxy, and an outline of the challenges are given in the first paper while inthe second a novel method based on evolutionary algorithms for designingfilter systems for astronomical surveys in order to provide optimal data onstars and to determine their physical parameters is introduced.

The articles with applications in the Bio-Sciences all deal with enzyme,DNA, microarray, or protein data, except the presentation of results of a sys-

VIII Preface

tematic and quantitative comparison of pattern recognition methods in theanalysis of clinical magnetic resonance spectra applied to the detection ofbrain tumor (Menze et al.). The Generative Topographic Mapping approachas an alternative to SOM for the analysis of microarray data (Grimmensteinet al.) and a finite conservative test for detecting a change point in a bi-nary sequence with Markov dependence and applications in DNA analysis(Krauth) are proposed as well as a new algorithm for finding similar sub-structures in enzyme active sites with the use of emergent self-organizingneural networks (Kupas and Ultsch). How the feature selection procedure“Significance Analysis of Microarrays” (SAM) and the classification method“Prediction Analysis of Microarrays” (PAM) can be applied to “Single Nu-cleotide Polymorphism” (SNP) data is explained (Schwender) as well as thatusing relative differences (RelDiff) instead of LogRatios for cDNA microarrayanalysis solves several problems like unlimited ranges, numerical instabilityand rounding errors (Ultsch). Finally, a novel method, PhyNav, to reconstructthe evolutionary relationship from really large DNA and protein datasets isintroduced applying the maximum likelihood principle (Vinh et al.).

Among the contributions on applications to Electronic Data and Webone paper discusses the application of clustering with restricted random walkson library usage histories in large document sets containing millions of objects(Franke and Thede). In the other four papers different aspects of web-miningare tackled. A tool is described assisting users of online news web-sites inorder to reduce information overload (Bomhardt and Gaul), benchmarks areoffered with respect to competition and visibility indices as predictors fortraffic in web-sites (Schmidt-Manz and Gaul), an algorithm is introduced forfuzzy two-mode clustering that outperforms collaborative filtering (Schlechtand Gaul), and visualizations of online search queries are compared to im-prove understanding of searching, viewing, and buying behavior of onlineshoppers and to further improve the generation of recommendations (Thomaand Gaul).

Two of the articles on Finance and Insurance deal with insuranceproblems: A strategy based on a combination of support vector regressionand kernel logistic regression to detect and to model high-dimensional de-pendency structures in car insurance data sets is proposed (Christmann) andsupport vector machines are compared to traditional statistical classificationprocedures in a life insurance environment (Steel and Hechter). Applicationsin Finance deal with evaluation of global and local statistical models forcomplex data sets of credit risks with respect to practical constraints andasymmetric cost functions (Schwarz and Arminger), show how linear sup-port vector machines select informative patterns from a credit scoring datapool serving as inputs for traditional methods more familiar to practitioners(Stecking and Schebesch), analyze the question of risk budgeting in contin-uous time (Straßberger), and formulate a one-factor model for the correla-tion between probabilities of default across industry branches, comparing it

Preface IX

to more traditional methods on the basis of insolvency rates for Germany(Weißbach and Rosenow).

Besides one contribution on Library Science where it is argued that thehistory of classification is intensively linked to the history of library science(Lorenz) the volume encloses five papers on applications in Linguistics.It is shown that one meta-linguistic relation suffices to model the conceptstructure of the lexicon making use of intensional logic (Bagheri), that im-provements of the morphological segmentation of words using classical dis-tributional methods are possible (Benden), and that in Russian texts (lettersand poems by three different authors) word length is a characteristic of genre,rather than of authorship (Kelih et al.). A validation method of cluster analy-sis methods concerning the number and stability of clusters is described withthe help of an application in linguistics (Mucha and Haimerl), clustering ofword contexts is used in a large collection of texts for word sense induction,i.e. automatic discovery of the possible senses for a given ambiguous word(Rapp), and formal graphs that structure a document-related informationspace by using a natural language processing chain and a wrapping proce-dure are proposed (Rist).

There are three papers with applications in Macro-Economics, two ofthem dealing with the comparison of economic structures of different coun-tries. The sensitivity of economic rankings of countries based on indicatorvariables is discussed (Berrer et al.), structural variables of the 25 memberEuropean Union are analyzed and patterns are found to be quite differentbetween the 15 current and the 10 new members (Sell), while the questionwhether methods measuring (relative) importance of variables in the contextof classification allow interpretation of individual effects of highly correlatedeconomic predictors for the German business cycle (Enache and Weihs) istackled in a more methods-based contribution.

Within the Marketing applications one article shows by means of anintercultural survey (Bauer et al.) that the cyber community is not a homo-geneous group since online consumers can be classified into the three clusters:“risk avers doubters”, “open minded online-shoppers” and “reserved infor-mation seekers”. Two papers deal with reservation prices. A novel estimationprocedure of reservation prices combining adaptive conjoint analysis witha choice task using individually adapted price scales is proposed (Breidertet al.), and an explicit evaluation of variants of conjoint analysis togetherwith two types of data collection is described for the detection of reservationprices of product bundles applied to a seat system offered by a German carmanufacturer (Stauß and Gaul).

Music Science is an application field that is present at GfKl conferencesfor the first time. In this volume one paper deals with time series analysis, theother five papers apply classification methods. A new algorithm structure isintroduced for feature extraction from time series, its efficiency is proofed, andillustrated by different classification tasks for audio data (Mierswa). Classifi-

X Preface

cation methods are used to show that the more the musical sound is unstablein time domain the more pitch bending is admitted to the musician expressingemotions by music (Fricke). Classification rules for quality classes of “sightreading” (SR) are derived (Kopiez et al.) based on indicators of piano prac-tice, mental speed, working memory, inner hearing etc. as well as the total SRperformance of 52 piano students. Classification rules are also found for dig-itized sounds played by different instruments based on the Hough-transform(Rover et al.). Finally, classifications of possibly overlapping drum soundsby linear support vector machines (Van Steelant et al.) and of singers andinstruments into high or low musical registers only by means of timbre, i.e.after elimination of pitch information, are proposed (Weihs et al.).

Applications in Quality Assurance include one methodological paper(Jessenberger and Weihs) which proposes the use of the expected value of theso-called desirability function to assess the capability of a process. The otherpapers discuss different statistical aspects of a deep hole drilling process inmachine building. The Lyapunov exponent is used for the discrimination be-tween well-predictable and not-well-predictable time series with applicationsin quality control (Busse). Two multivariate control charts to monitor thedrilling process in order to prevent chatter vibrations and to secure produc-tion with high quality are proposed (Messaoud et al.) as well as a procedureto assess the changing amplitudes of relevant frequencies over time based onthe distribution of periodogram ordinates (Theis and Weihs).

IV.The fourth part of this volume starts with an introduction to the competi-

tion on “Social Milieus in Dortmund” (Sommerer and Weihs). Moreover, thebest three papers of the competition by Scheid, by Schafer and Lemm, andby Rover and Szepannek appear in this volume. We would like to thank thehead of the “dortmund-project”, Udo Mager, and the head of the Fachbereich“Statistik und Wahlen” of the City of Dortmund, Ernst-Otto Sommerer, fortheir kind support.The conference owed much to its sponsors (in alphabetical order)

• Deutsche Forschungsgemeinschaft (DFG), Bonn,• dortmund-project, Dortmund,• Fachbereich Statistik, Universitat Dortmund, Dortmund,• Landesbeauftragter fur die Beziehungen zwischen den Hochschulen in

NRW und den Beneluxstaaten,• Novartis, Basel, Switzerland,• Roche Diagnostics, Penzberg,• sas Deutschland, Heidelberg,• Sonderforschungsbereich 475, Dortmund,• Springer-Verlag, Heidelberg,• Universitat Dortmund, and• John Wiley and Sons, Chicester, UK.

Preface XI

who helped in many ways. Their generous support is gratefully acknowledged.Additionally, we wish to express our gratitude to the authors of the pa-

pers in the present volume, not only for their contributions, but also for theirdiligence and timely production of the final versions of their papers. Fur-thermore, we thank the reviewers for their careful reviews of the originallysubmitted papers, and in this way, for their support in selecting the bestpapers for this publication.

We would like to emphasize the outstanding work of Uwe Ligges and NilsRaabe who did an excellent job in organizing the program of the confer-ence and the refereeing process as well as in preparing the abstract bookletand this volume, respectively. We also wish to thank our colleague Prof.Dr. Ernst-Erich Doberkat, Fachbereich Informatik, University Dortmund,for co-organizing the conference, and the Fachbereich Statistik of the Uni-versity Dortmund for all the support, in particular Anne Christmann, Dr.Daniel Enache, Isabelle Grimmenstein, Dr. Sonja Kuhnt, Edelgard Kurbis,Karsten Luebke, Dr. Constanze Pumplun, Oliver Sailer, Roland Schultze,Sibylle Sturtz, Dr. Winfried Theis, Magdalena Thone, and Dr. Heike Traut-mann as well as other members and students of the Fachbereich for helping toorganize the conference and making it a big success, and Alla Stankjawitsch-ene and Dr. Stefan Dißmann from the Fachbereich Informatik for all they didin organizing all financial affairs.

Finally, we want to thank Christiane Beisel and Dr. Martina Bihn ofSpringer-Verlag, Heidelberg, for their support and dedication to the produc-tion of this volume.

Dortmund and Karlsruhe, Claus Weihs, Wolfgang GaulApril 2005

Contents

Part I. (Semi-) Plenary Presentations

Classification and Data Mining in Musicology . . . . . . . . . . . . . . . . 3Jan Beran

Bayesian Mixed Membership Models for Soft Clustering andClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Elena A. Erosheva, Stephen E. Fienberg

Predicting Protein Secondary Structure with Markov Models . 27Paul Fischer, Simon Larsen, Claus Thomsen

Milestones in the History of Data Visualization: A Case Studyin Statistical Historiography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Michael Friendly

Quantitative Text Typology: The Impact of Word Length . . . . 53Peter Grzybek, Ernst Stadlober, Emmerich Kelih, Gordana Antic

Cluster Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Kurt Hornik

Bootstrap Confidence Intervals for Three-way ComponentMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Henk A.L. Kiers

Organising the Knowledge Space for Software Components . . . 85Claus Pahl

Multimedia Pattern Recognition in Soccer Video Using TimeIntervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Cees G.M. Snoek, Marcel Worring

Quantitative Assessment of the Responsibility for the DiseaseLoad in a Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Wolfgang Uter, Olaf Gefeller

XIV Contents

Part II. Classification and Data Analysis

Classification

Bootstrapping Latent Class Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Jose G. Dias

Dimensionality of Random Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . 129Eugeniusz Gatnar

Two-stage Classification with Automatic Feature Selection foran Industrial Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Soren Hader, Fred A. Hamprecht

Bagging, Boosting and Ordinal Classification . . . . . . . . . . . . . . . . . 145Klaus Hechenbichler, Gerhard Tutz

A Method for Visual Cluster Validation . . . . . . . . . . . . . . . . . . . . . . 153Christian Hennig

Empirical Comparison of Boosting Algorithms . . . . . . . . . . . . . . . . 161Riadh Khanchel, Mohamed Limam

Iterative Majorization Approach to the Distance-basedDiscriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

Serhiy Kosinov, Stephane Marchand-Maillet, Thierry Pun

An Extension of the CHAID Tree-based SegmentationAlgorithm to Multiple Dependent Variables . . . . . . . . . . . . . . . . . . 176

Jay Magidson, Jeroen K. Vermunt

Expectation of Random Sets and the ‘Mean Values’ of IntervalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Ole Nordhoff

Experimental Design for Variable Selection in Data Bases . . . . 192Constanze Pumplun, Claus Weihs, Andrea Preusser

KMC/EDAM: A New Approach for the Visualization ofK-Means Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

Nils Raabe, Karsten Luebke, Claus Weihs

Contents XV

Clustering of Variables with Missing Data: Application toPreference Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

Karin Sahmer, Evelyne Vigneau, El Mostafa Qannari, JoachimKunert

Binary On-line Classification Based on Temporally IntegratedInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Christin Schafer, Steven Lemm, Gabriel Curio

Different Subspace Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224Gero Szepannek, Karsten Luebke

Density Estimation and Visualization for Data ContainingClusters of Unknown Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

Alfred Ultsch

Hierarchical Mixture Models for Nested Data Structures . . . . . 240Jeroen K. Vermunt, Jay Magidson

Data Analysis

Iterative Proportional Scaling Based on a Robust StartEstimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

Claudia Becker

Exploring Multivariate Data Structures with Local PrincipalCurves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

Jochen Einbeck, Gerhard Tutz, Ludger Evers

A Three-way Multidimensional Scaling Approach to theAnalysis of Judgments About Persons . . . . . . . . . . . . . . . . . . . . . . . . 264

Sabine Krolak–Schwerdt

Discovering Temporal Knowledge in Multivariate Time Series 272Fabian Morchen, Alfred Ultsch

A New Framework for Multidimensional Data Analysis . . . . . . . 280Shizuhiko Nishisato

External Analysis of Two-mode Three-way AsymmetricMultidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

Akinori Okada, Tadashi Imaizumi

The Relevance Vector Machine Under CovariateMeasurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

David Rummel

XVI Contents

Part III. Applications

Archaeology

A Contribution to the History of Seriation in Archaeology . . . . 307Peter Ihm

Model-based Cluster Analysis of Roman Bricks and Tiles fromWorms and Rheinzabern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317

Hans-Joachim Mucha, Hans-Georg Bartel, Jens Dolata

Astronomy

Astronomical Object Classification and Parameter Estimationwith the Gaia Galactic Survey Satellite . . . . . . . . . . . . . . . . . . . . . . . 325

Coryn A.L. Bailer-Jones

Design of Astronomical Filter Systems for StellarClassification Using Evolutionary Algorithms . . . . . . . . . . . . . . . . . 330

Coryn A.L. Bailer-Jones

Bio-Sciences

Analyzing Microarray Data with the Generative TopographicMapping Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

Isabelle M. Grimmenstein, Karsten Quast, Wolfgang Urfer

Test for a Change Point in Bernoulli Trials with Dependence . 346Joachim Krauth

Data Mining in Protein Binding Cavities . . . . . . . . . . . . . . . . . . . . . 354Katrin Kupas, Alfred Ultsch

Classification of In Vivo Magnetic Resonance Spectra . . . . . . . . 362Bjorn H. Menze, Michael Wormit, Peter Bachert, Matthias Lichy,Heinz-Peter Schlemmer, Fred A. Hamprecht

Modifying Microarray Analysis Methods for Categorical Data– SAM and PAM for SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370

Holger Schwender

Improving the Identification of Differentially Expressed Genesin cDNA Microarray Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

Alfred Ultsch

Contents XVII

PhyNav: A Novel Approach to Reconstruct LargePhylogenies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386

Le Sy Vinh, Heiko A. Schmidt, Arndt von Haeseler

Electronic Data and Web

NewsRec, a Personal Recommendation System for NewsWebsites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394

Christian Bomhardt, Wolfgang Gaul

Clustering of Large Document Sets with Restricted RandomWalks on Usage Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

Markus Franke, Anke Thede

Fuzzy Two-mode Clustering vs. Collaborative Filtering . . . . . . . 410Volker Schlecht, Wolfgang Gaul

Web Mining and Online Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418Nadine Schmidt-Manz, Wolfgang Gaul

Analysis of Recommender System Usage by MultidimensionalScaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426

Patrick Thoma, Wolfgang Gaul

Finance and Insurance

On a Combination of Convex Risk Minimization Methods . . . . 434Andreas Christmann

Credit Scoring Using Global and Local Statistical Models . . . . 442Alexandra Schwarz, Gerhard Arminger

Informative Patterns for Credit Scoring Using Linear SVM . . . 450Ralf Stecking, Klaus B. Schebesch

Application of Support Vector Machines in a Life AssuranceEnvironment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458

Sarel J. Steel, Gertrud K. Hechter

Continuous Market Risk Budgeting in Financial Institutions . . 466Mario Straßberger

Smooth Correlation Estimation with Application toPortfolio Credit Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474

Rafael Weißbach and Bernd Rosenow

XVIII Contents

Library Science and Linguistics

How Many Lexical-semantic Relations are Necessary? . . . . . . . . 482Dariusch Bagheri

Automated Detection of Morphemes Using DistributionalMeasurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490

Christoph Benden

Classification of Author and/or Genre? The Impact of WordLength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498

Emmerich Kelih, Gordana Antic, Peter Grzybek, Ernst Stadlober

Some Historical Remarks on Library Classification – a ShortIntroduction to the Science of Library Classification . . . . . . . . . . 506

Bernd Lorenz

Automatic Validation of Hierarchical Cluster Analysis withApplication in Dialectometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

Hans-Joachim Mucha, Edgar Haimerl

Discovering the Senses of an Ambiguous Word by Clusteringits Local Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521

Reinhard Rapp

Document Management and the Development of InformationSpaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

Ulfert Rist

Macro-Economics

Stochastic Ranking and the Volatility “Croissant”:A Sensitivity Analysis of Economic Rankings . . . . . . . . . . . . . . . . . 537

Helmut Berrer, Christian Helmenstein, Wolfgang Polasek

Importance Assessment of Correlated Predictors in BusinessCycles Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545

Daniel Enache, Claus Weihs

Economic Freedom in the 25-Member European Union:Insights Using Classification Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553

Clifford W. Sell

Marketing

Intercultural Consumer Classifications in E-Commerce . . . . . . . 561Hans H. Bauer, Marcus M. Neumann, Frank Huber

Contents XIX

Reservation Price Estimation by Adaptive Conjoint Analysis . 569Christoph Breidert, Michael Hahsler, Lars Schmidt-Thieme

Estimating Reservation Prices for Product Bundles Based onPaired Comparison Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

Bernd Stauß, Wolfgang Gaul

Music Science

Classification of Perceived Musical Intervals . . . . . . . . . . . . . . . . . . 585Jobst P. Fricke

In Search of Variables Distinguishing Low and High Achieversin a Music Sight Reading Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593

Reinhard Kopiez, Claus Weihs, Uwe Ligges, Ji In Lee

Automatic Feature Extraction from Large Time Series . . . . . . . . 600Ingo Mierswa

Identification of Musical Instruments by Means of theHough-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608

Christian Rover, Frank Klefenz, Claus Weihs

Support Vector Machines for Bass and Snare DrumRecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616

Dirk Van Steelant, Koen Tanghe, Sven Degroeve, Bernard DeBaets, Marc Leman, Jean-Pierre Martens

Register Classification by Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624Claus Weihs, Christoph Reuter, Uwe Ligges

Quality Assurance

Classification of Processes by the Lyapunov Exponent . . . . . . . . 632Anja M. Busse

Desirability to Characterize Process Capability . . . . . . . . . . . . . . . 640Jutta Jessenberger, Claus Weihs

Application and Use of Multivariate Control Charts in a BTADeep Hole Drilling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648

Amor Messaoud, Winfried Theis, Claus Weihs, Franz Hering

Determination of Relevant Frequencies and Modeling VaryingAmplitudes of Harmonic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 656

Winfried Theis, Claus Weihs

XX Contents

Part IV. Contest: Social Milieus in Dortmund

Introduction to the Contest “Social Milieus in Dortmund” . . . 667Ernst-Otto Sommerer, Claus Weihs

Application of a Genetic Algorithm to Variable Selection inFuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674

Christian Rover, Gero Szepannek

Annealed k-Means Clustering and Decision Trees . . . . . . . . . . . . . 682Christin Schafer, Julian Laub

Correspondence Clustering of Dortmund City Districts . . . . . . . 690Stefanie Scheid

Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698

Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703

Part I

(Semi-) Plenary Presentations

Classification and Data Mining in Musicology

Jan Beran

Department of Mathematics and Statistics,University of Konstanz, 78457 Konstanz, Germany

Abstract. Data in music are complex and highly structured. In this talk a numberof descriptive and model-based methods are discussed that can be used as pre-processing devices before standard methods of classification, clustering etc. canbe applied. The purpose of pre-processing is to incorporate prior knowledge inmusicology and hence to filter out information that is relevant from the point ofview of music theory. This is illustrated by a number of examples from classicalmusic, including the analysis of scores and of musical performance.

1 Introduction

Mathematical considerations in music have a long tradition. The most ob-vious connection between mathematics and music is through physics. Forinstance, in ancient Greece, the Pythagoreans discovered the musical signif-icance of simple frequency ratios such as 2/1 (octave), 3/2 (pure fifth), 4/3(pure fourth) etc., and their relation to the length of a string. There are, how-ever, deeper connections between mathematical and musical structures thatgo far beyond acoustics. Many of these can be discovered using techniquesfrom data mining, together with a priori knowledge from music theory. Theresults can be used, for instance, to solve classification problems. This isillustrated in the following sections by three types of examples.

2 Music, 1/f -noise, fractal and chaos

In their celebrated – but also controversial – paper, Voss and Clarke (1975)postulated that recorded music is essentially 1/f -noise (in the spectral do-main), after high frequencies have been eliminated. (The term 1/f -noise isgenerally used for random processes whose power spectrum is dominated bylow frequencies f such that its value is proportional to 1/f.) Can we verifythis statement? At first, the following question needs to be asked: Whichaspects of a composition does recorded music represent? Sound waves aredetermined not only by the selection of notes, but also by the instrumentalsound itself. It turns out, however, that the sound wave of a musical instru-ment often resembles 1/f -noise (see e.g. Beran (2003)). Thus, if recordedmusic looks like 1/f -noise, this may be due to the instrument rather thana particular composition. To separate instrumental sounds from composedmusic, we therefore consider the score itself, in terms of pitch and onset

4 Beran

time. The problem of superposition of notes in polyphonic music is solvedby replacing chords by arpeggio chords, replacing a chord by the sequence ofnotes in the chord starting with the lowest note. In order to eliminate highfrequencies and to simplify the spectral density, data are aggregated by tak-ing averages over disjoint blocks of k = 7 notes (see Beran and Ocker (2001)and Tsai and Chan (2004) for a theoretical justification). Subsequently, asemiparametric fractional model with nonparametric trend function, the so-called SEMIFAR-model (Beran and Feng (2002), also see Beran (1994)), isfitted to the aggregated series. In a SEMIFAR-model, the stochastic part hasa generalized spectral density behaving at the origin like 1/fα (where f isthe frequency) with α = 2d for some − 1

2 < d. Thus, 1/f -noise correspondsto d = 1/2. Figure 1 shows smoothed histograms of α for four different timeperiods. The results are based on 60 compositions ranging from the 13th tothe 20th century. Apparently a value around α = 1 is favored in classicalmusic up to the early romantic period (first three distributions, from above).However, this preference is less clear in the late 19th and the 20th century.Similar investigations can be made for other characteristics of a composi-tion. For instance, we may consider onset time gaps between the occurenceof a particular note. Figure 2 displays typical log-log-periodograms and fittedspectra, for gap series referring to the most frequent note (modulo 12). Notethat, near zero, each fitted log-log-curve essentially behaves like a straightline with estimated slope α.

In summary, we may say that 1/fα-behaviour with α > 0 appears to becommon for many musical parameters. The fractal parameter α = 2d maybe interpreted as a summary statistic of the degree of variation and memory.From the examples here it is clear, however, that 1/f -noise is not the only,though perhaps the most frequent, type of variation.

3 Music and entropy

The fractal parameter d (or α = 2d) is a measure of randomness and coher-ence (memory) in the sense mentioned above. Another, in some sense moredirect, measure of randomness is entropy. Consider, for instance, the distri-bution of notes modulo 12 and its entropy. We calculate the entropy for 148compositions by the following composers: Anonymus (dates of birth between1200 and 1500), Halle (1240-1287), Ockeghem (1425-1495), Arcadelt (1505-1568), Palestrina (1525-1594), Byrd (1543-1623), Dowland (1562-1626), Has-sler (1564-1612), Schein (1586-1630), Purcell (1659-1695), D. Scarlatti (1660-1725), F. Couperin (1668-1733), Croft (1678-1727), Rameau (1683-1764),J.S. Bach (1685-1750), Campion (1686-1748), Haydn (1732-1809), Clementi(1752-1832), W.A. Mozart (1756-1791), Beethoven (1770-1827), Chopin(1810-1849), Schumann (1810-1856), Wagner (1813-1883), Brahms (1833-1897), Faure (1845-1924), Debussy (1862-1918), Scriabin (1872-1915), Rach-maninoff (1873-1943), Schoenberg (1874-1951), Bartok (1881-1945), Webern

Classification and Data Mining in Musicology 5

d1$x

d1

$y

-1.5 -1.0 -0.5 0.0

0.5

1.5

Distribution of -2d: up to 1700

d2$x

d2

$y

-1.5 -1.0 -0.5 0.0

02

46

Distribution of -2d: 1700-1800

d3$x

d3$y

-1.5 -1.0 -0.5 0.0

0.0

1.0

2.0

Distribution of -2d: 1800-1860

d4$x

d4$y

-1.5 -1.0 -0.5 0.0

0.0

1.0

2.0

Distribution of -2d: after 1860

Fig. 1. Distribution of −α = −2d for four different time periods.

(1883-1945), Prokoffieff (1891-1953), Messiaen (1908-1992), Takemitsu (1930-1996) and Beran (*1959). For a detailed description how the entropy is cal-culated see Beran (2003). A plot of entropy against the date of birth of thecomposer (figure 3) reveals a positive dependence, in particular after 1400.Why that is so can be seen, at least partially, from star plots of the distribu-tions. Figure 4 shows a random selection of star plots ranging from the 15th tothe 20th century. In order to reveal more structure, the 12 note categories areordered according to the ascending circle of fourths. The most striking featureis that for compositions that may be classified as purely tonal in a traditionalsense, there is a neighborhood of 7 to 8 adjacent notes where beams are verylong, and for the rest of the categories not much can be seen. The plausiblereason is that in tonal music, the circle of fourths is a dominating featurethat determines a lot of the structure. This is much less the case for classicalmusic of the 20th century. With respect to entropy it means that for newermusic, the (marginal) distribution of notes is much less predictable than inearlier music (see figure 3 where composers born after 1881 are marked as“20th century”, namely Prokoffieff, Messiaen, Takemitsu, Webern and Be-ran). Note, however, that there are also a few outliers in figure 3. Thus, therule is not universal, and entropy may depend on the individual composer or

6 Beran

log(frequency)

log(

spec

trum

)

0.5 1.0

0.00

10.

005

0.05

0

Bach: Prelude and Fugue, WK I, No. 17,spectrum of aggregated gaps (d=0.5)

log(frequency)

log(

spec

trum

)

0.5 1.0

0.00

50.

050

Rameau: Le Tambourin,spectrum of aggregated gaps (d=0.5)

log(frequency)

log(

spec

trum

)

0.5 1.0

0.00

050.

0050

0.05

000.

5000

Scarlatti: Sonata K49,spectrum of aggregated gaps (d=0.56)

log(frequency)

log(

spec

trum

)

0.5 1.0

0.00

50.

050

Rachmaninoff: op. 3, No. 2,spectrum of aggregated gaps (d=0.5)

Fig. 2. Log-log-periodograms and fitted spectra for gap time series.

even the composition. In the last millennium, music moved gradually fromrather strict rules to increasing variety. It is therefore not surprising thatvariability increases throughout the centuries - composers simply have morechoice. On the other hand, a comparison of Schumann’s entropies (which werenot included in figure 3) with those by Bach points in the opposite direction(figure 5). As a cautionary remark it should also be noted that this data setis a very small, and partially unbalanced, sample from the huge number ofexisting compositions. For instance, Prokoffieff is included 15 times whereasmany other composers of the 20th century are missing. A more systematicempirical investigation will need to be carried out to obtain more conclusiveresults.

4 Score information and performance

Due to advances in music technology, performance theory is a very activearea of research where statistical analysis plays an essential role. In contrastto some other branches of musicology, repeated observations and controlledexperiments can be carried out. With respect to music where a score exists,the following question is essential: Which information is there in a score,and how can it be quantified? Beran and Mazzola (1999a) (also see Maz-zola (2002) and Beran (2003)) propose to encode structural information of a


date of birth

en

tro

py

1200 1400 1600 1800

1.8

2.0

2.2

2.4

Arc

ad

elt

Arc

ad

elt

Arc

ad

elt

Pa

lest

rin

aP

ale

strin

aP

ale

strin

aB

yrd

Byr

dB

yrd

Ha

ssle

rH

ass

ler

Ha

ssle

r

Sca

rlatti

Sca

rlatti

Sca

rlatti

Sca

rlatti

Sca

rlatti

Sca

rlatti

Sca

rlatti

Sca

rlatti

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Bach

Hayd

nH

ayd

nH

ayd

nH

ayd

nH

ayd

nH

ayd

nH

ayd

n

Chopin

Chopin

Chopin

Chopin

20th

centu

ry20th

centu

ry

20th

centu

ry

20th

centu

ry

20

th c

en

tury

Fig. 3. Entropy of notes in Z12 versus date of birth.

score by so-called metric, harmonic and melodic weights or indicators. Thesecurves quantify the metric, harmonic and melodic importance of a note re-spectively. A modified motivic indicator based on a priori knowledge aboutmotifs in the score is defined in Beran (2003). Figure 6 shows some indicatorfunctions corresponding to eight different motifs in Schumann’s Traumerei.These curves can be related to observed performance data by various sta-tistical methods (see e.g. Beran (2003), Beran and Mazzola (1999b, 2000,2001)). For instance, figure 7 displays tempo curves of different pianists afterapplying data sharpening with the indicator function of motif 2. Sharpeningwas done by considering only those onset times where the indicator curveof motif 2 is above its 90th percentile. This leads to simplified tempo curveswhere differences and communalities are more visible. Also, sharpened tempocurves can be used as input for other statistical techniques, such as classifi-cation. A typical example is given in figure 8, where clustering is based themotif-2-sharpened tempo curves in figure 7.

Acknowledgements

I would like to thank B. Repp for providing us with the tempo measurements.

8 Beran

OCKEGHEM ARCADELT ARCADELT BYRD RAMEAU RAMEAU

RAMEAU BACH BACH BACH SCARLATTI HAYDN

MOZART MOZART SCHUMANN SCHUMANN SCHUMANN CHOPIN

WAGNER WAGNER DEBUSSY DEBUSSY SCRIABIN SCRIABIN

BARTOK BARTOK BARTOK BARTOK MESSIAEN PROKOFFIEFF

PROKOFFIEFF MESSIAEN SCHOENBERG WEBERN TAKEMITSU BERAN

Fig. 4. Star plots of Z12-distribution, ordered according to the circle of fourths.

1.6

1.8

2.0

2.2

Bach Schumann

Fig. 5. Boxplots of entropies for Bach (left) and Schumann (right), based on notedistribution in Z12.


onset time

x1

0 5 10 15 20 25 30

0.0

1.0

2.0

a onset time

x2

0 5 10 15 20 25 30

0.0

1.0

2.0

b

onset time

x3

0 5 10 15 20 25 30

0.0

1.0

2.0

c onset time

x4

0 5 10 15 20 25 30

0.0

1.0

2.0

d

onset time

x5

0 5 10 15 20 25 30

0.0

1.0

2.0

e onset time

x6

0 5 10 15 20 25 30

0.0

1.0

2.0

f

onset time

x7

0 5 10 15 20 25 30

0.0

1.0

2.0

g onset time

x8

0 5 10 15 20 25 30

0.0

1.0

2.0

h

Fig. 6. Motivic indicators for Schumann’s Traumerei.

tem

po[i2

, j]

5 10 15 20

-20

1

ARGERICH

tem

po[i2

, j]

5 10 15 20

-1.5

0.0

ARRAU

tem

po[i2

, j]

5 10 15 20

-3-1

1

ASKENAZE

tem

po[i2

, j]

5 10 15 20

-20

1

BRENDEL

tem

po[i2

, j]

5 10 15 20

-1.5

0.0

BUNIN

tem

po[i2

, j]

5 10 15 20

-2-1

0

CAPOVA

tem

po[i2

, j]

5 10 15 20

-3-1

1

CORTOT1

tem

po[i2

, j]

5 10 15 20

-20

1

CORTOT2

tem

po[i2

, j]

5 10 15 20

-2-1

01

CORTOT3

tem

po[i2

, j]

5 10 15 20

-3-1

01

CURZON

tem

po[i2

, j]

5 10 15 20

-4-2

01

DAVIES

tem

po[i2

, j]

5 10 15 20

-3-1

1

DEMUS

tem

po[i2

, j]

5 10 15 20

-3-1

0

ESCHENBACH

tem

po[i2

, j]

5 10 15 20

-3-1

GIANOLI

tem

po[i2

, j]

5 10 15 20

-1.5

0.0

HOROWITZ1

tem

po[i2

, j]

5 10 15 20

-20

1

HOROWITZ2

tem

po[i2

, j]

5 10 15 20

-20

1

HOROWITZ3

tem

po[i2

, j]

5 10 15 20

-4-2

0

KATSARIS

tem

po[i2

, j]

5 10 15 20

-2.5

-1.0

0.5

KLIEN

tem

po[i2

, j]

5 10 15 20

-4-2

01

KRUST

tem

po[i2

, j]

5 10 15 20

-20

1

KUBALEK

tem

po[i2

, j]

5 10 15 20

-20

1

MOISEIWITSCH

tem

po[i2

, j]

5 10 15 20

-3-1

1

NEY

tem

po[i2

, j]

5 10 15 20

-3-1

1

NOVAES

tem

po[i2

, j]

5 10 15 20

-2.5

-0.5

ORTIZ

tem

po[i2

, j]

5 10 15 20

-2.5

-0.5

SCHNABEL

tem

po[i2

, j]

5 10 15 20

-3-1

1

SHELLEY

tem

po[i2

, j]

5 10 15 20

-20

1

ZAK

Fig. 7. Schumann’s Traumerei: Tempo curves sharpened by 90th percentile ofmotif-curve 2.

10 Beran

AR

GE

RIC

H

AR

RA

U

AS

KE

NA

ZE

BR

EN

DE

L

BU

NIN

CA

PO

VA

CO

RT

OT

1

CO

RT

OT

2CO

RT

OT

3

CU

RZ

ON

DA

VIE

S

DE

MU

S

ES

CH

EN

BA

CH

GIA

NO

LI

HO

RO

WIT

Z1

HO

RO

WIT

Z2

HO

RO

WIT

Z3

KA

TS

AR

IS

KLI

EN

KR

US

T

KU

BA

LEK

MO

ISE

IWIT

SC

H NE

Y

NO

VA

ESOR

TIZ

SC

HN

AB

EL

SH

ELL

EY

ZA

K

12

34

56

Motive-2-indicator: 90%-quantile-clustering

Fig. 8. Schumann’s Traumerei: Tempo clusters based on sharpened tempo.

References

BERAN, J. (2003): Statistics in Musicology. Chapman & Hall, CRC Press, BocaRaton.

BERAN, J. (1994): Statistics for long-memory processes. Chapman & Hall, London.BERAN, J. and FENG, Y. (2002): SEMIFAR models – a semiparametric frame-

work for modeling trends, long-range dependence and nonstationarity. Com-putational Statistics and Data Analysis, 40(2), 690–713.

BERAN, J. and MAZZOLA, G. (1999a): Analyzing musical structure and perfor-mance - a statistical approach. Statistical Science, 14(1), 47–79.

BERAN, J. and MAZZOLA, G. (1999b): Visualizing the relationship between twotime series by hierarchical smoothing. J. Computational and Graphical Statis-tics, 8(2), 213–238.

BERAN, J. and MAZZOLA, G. (2000): Timing Microstructure in Schumann’sTraumerei as an Expression of Harmony, Rhythm, and Motivic Structure inMusic Performance. Computers Mathematics Appl., 39(5-6), 99–130.

BERAN, J. and MAZZOLA, G. (2001): Musical composition and performance -statistical decomposition and interpretation. Student, 4(1), 13–42.

BERAN, J. and OCKER, D. (2000): Temporal Aggregation of Stationary and Non-stationary FARIMA(p,d,0) Models. CoFE Discussion Paper, No. 00/22. Uni-versity of Konstanz.

MAZZOLA, G. (2002): The topos of music. Birkhauser, Basel.TSAI, H. and CHAN, K.S. (2004): Temporal Aggregation of Stationary and Non-

stationary Discrete-Time Processes. Technical Report, No. 330, University ofIowa, Statistics and Actuarial Science.

VOSS, R.F. and CLARKE, J. (1975): 1/f noise in music and speech. Nature, 258,317–318.

Bayesian Mixed Membership Models for Soft

Clustering and Classification

Elena A. Erosheva1 and Stephen E. Fienberg2

1 Department of Statistics,School of Social Work,Center for Statistics and the Social Sciences,University of Washington, Seattle, WA 98195, U.S.A.

2 Department of Statistics,Center for Automated Learning and Discovery,Center for Computer and Communications SecurityCarnegie Mellon University, Pittsburgh, PA 15213, U.S.A.

Abstract. The paper describes and applies a fully Bayesian approach to soft clus-tering and classification using mixed membership models. Our model structurehas assumptions on four levels: population, subject, latent variable, and samplingscheme. Population level assumptions describe the general structure of the popula-tion that is common to all subjects. Subject level assumptions specify the distribu-tion of observable responses given individual membership scores. Membership scoresare usually unknown and hence we can also view them as latent variables, treatingthem as either fixed or random in the model. Finally, the last level of assumptionsspecifies the number of distinct observed characteristics and the number of replica-tions for each characteristic. We illustrate the flexibility and utility of the generalmodel through two applications using data from: (i) the National Long Term CareSurvey where we explore types of disability; (ii) abstracts and bibliographies fromarticles published in The Proceedings of the National Academy of Sciences. In thefirst application we use a Monte Carlo Markov chain implementation for samplingfrom the posterior distribution. In the second application, because of the size andcomplexity of the data base, we use a variational approximation to the posterior.We also include a guide to other applications of mixed membership modeling.

1 Introduction

The canonical clustering problem has traditionally had the following form:for N units or objects measured on J variables, organize the units into Ggroups, where the nature, size, and often the number of the groups is un-specified in advance. The classification problem has a similar form exceptthat the nature and the number of groups are either known theoretically orinferred from units in a training data set with known group assignments. Inmachine learning, methods for clustering and classification are referred toas involving “unsupervised” and “supervised learning” respectively. Most ofthese methods assume that every unit belongs to exactly one group. In thispaper, we will primarily focus on clustering, although methods described canbe used for both clustering and classification problems.

12 Erosheva and Fienberg

Some of the most commonly used clustering methods are based on hi-erarchical or agglomerative algorithms and do not employ distributional as-sumptions. Model-based clustering lets x = (x1, x2, . . . , xJ) be a sample of Jcharacteristics from some underlying joint distribution, Pr(x|θ). Assumingeach sample is coming from one of G groups, we estimate Pr(x|θ) indicat-ing presence of groups or lack thereof. We represent the distribution of thegth group by Prg(x|θ) and then model the observed data using the mixturedistribution:

Pr(x|θ) =G∑

g=1

πgPrg(x|θ), (1)

with parameters {θ, πg}, and G.The assumption that each object belongs exclusively to one of the G

groups or latent classes may not hold, e.g., when characteristics sampled areindividual genotypes, individual responses in an attitude survey, or wordsin a scientific article. In such cases, we say that objects or individuals havemixed membership and the problem involves soft clustering when the natureof groups is unknown or soft classification when the nature of groups is knownthrough distributions Prg(x|θ), g = 1, . . . , G, specified in advance.

Mixed membership models have been proposed for applications in severaldiverse areas. We describe six of these here:

1. NLTCS Disability Data. The National Long Term Care Survey assessesdisability in U.S. elderly population. We have been working with a 216

contingency table on functional disability drawing on combined data fromthe 1982, 1984, 1989, and 1994 waves of the survey. The dimensions ofthe table correspond to 6 Activities of Daily Living (ADLs)–e.g., gettingin/out of bed and using a toilet–and 10 Instrumental Activities of DailyLiving (IADLs)–e.g., managing money and taking medicine. In Section3, we describe some of our results for the combined NLTCS data. Wenote that further model extensions are possible to account for the lon-gitudinal nature of the study, e.g., via employing a powerful conditionalindependence assumption to accommodate a longitudinal data structureas suggested by Manton et al. (1994).

2. DSM-III-R Psychiatric Classifications. One of the earliest proposals formixed membership models was by Woodbury et al. (1978), in the con-text of disease classification. Their model became known as the Gradeof Memebership or GoM model, and was later used by Nurnberg et al.(1999) to study the DSM-III-R typology for psychiatric patients. Theiranalysis involved N = 110 outpatients and used the J = 112 DSM-III-Rdiagnostic criteria for clustering in order to reassess the appropriatenessof the “official” 12 personality disorders. One could also approach thisproblem as a classical classification problem but with J > N.

download.e-bookshelf.de€¦ · preface this volume contains revised versions of selected papers...

Documents