download.e-bookshelf.de€¦ · preface this volume contains revised versions of selected papers...
TRANSCRIPT
������� �� ������ ������ ���� ����������� �������� ������������
�������� ����� ������� ������� �� �� �� ��� ��� ��� ��� !�����"� #��� �������� �� ������ ��� ��$� %� ��� &�'� (� ��� ���� $���� ������
&� �� ���� �������)� ������ �����$� #����� ��� ��� ����� *����� !�+��,� $��'��� *������� $������ �������� !��������� -������!� ����'�� -������ �+���� ���� ���#� &������ ������$� � ������ $������'� "����� ����'���
������ �� ��� ����
����� �� �� "� *������ ��� $�$� &� ����.)���/0����'����� �����'� ��� ���� ��������1223 .��� �� +����/
)� ������ 4� *� ��5����� $� � �������� ��������� ��� �� ����� �� .)���/!�� �++��� ��� �� ������ ����� ������� �������� 1223 .��� �� +����/
"� #�� ��� �� ������� .)���/(��' ���� �� ��������� 1226
����� �� � ��� "� ������ .)���/���� ������� ��� 0����'����� �����'��1227
)� ������ 4� *� ��5����� ��� �� �+���.)���/������ ��� ��' �� ���� �������� 1227
&� ��� ��� �� �+��� .)���/������ ����� ��� ��������������������� 1228
� �������� !� ����'�� �� 4�9�'��4� -������ ����� �� �� ��� 4� �� � .)���/���� � ��� �� ������ ��������� &����� $������� 122:
0� �����9���� &� $������ ��� $� � �����.)���/������ ������ ���� ����������� ���� ��������� 122:
�� &����� $� %� ��� ��� ����� �� � .)���/��5�� �� �� ���� � ��� ���� ������ ������ 122:
$� %� �� ��� �� �+��� .)���/������ ����� ��� ���� �������� 1222
"� #�� ��� �� *� �����,���� .)���/������ ����� �� ��� 0����'����� ���� 1222
����� �� � ��� )� ����� .)���/������� �� ��' �� ����� ;<<<
����*� ������ ,���� &������ �� ,� (� #���������� $� � ����� .)���/���� �������� ������ ��������� &����� $������� ;<<<
"� #��� �� �+���� ��� $� � ����� .)���/���� �������� ;<<<
&� �� ��� ��� "� #�� .)���/������ ����� ��� 0����'����� ��� �������� ��� -��� �� ��� $�����'� ;<<<
�� ������ &� &� �� $� %� ������ $� � ����� .)���/��5�� �� �� ������ �������� ���� �������� ;<<1
"� #�� ��� #� &����� .)���/������ ������ ����'��������� !�� $����� ;<<;
�� ,�9���� �� ����=������ ��� ����� �� �.)���/������ ������ �������� ��� ������������ ;<<;
$� � ������� ��� �� �+��� .)���/)>+������� ���� ��������� )'+��� � &����� �� ;<<?
$� � ������ "� #��� ��� $� %� �� .)���/������� ���� � ��� � ����++��� ���� �������� ;<<?
����� �� �� $� ������ ��� �� $����.)���/��5�� �� �� $���5������ ���� ��������;<<3
�� ������ *� ������ (�&� $ $�������� ��� ��� ��� "� #�� .)���/������ ������ ��������� ��� ����$����� �++� ������� ;<<3
�� ����� ��� ����� "���� �� .)���/0���5������ �� ������ ������ ����� ��� �� ��� 0����'����� �����'�� ;<<6
$� %� ��� �� $������ �� $��������� �� $�������� .)���/!�� ��5��+'���� �� ������ ����� ������� �������� ;<<6
�� ������ &� �� ���� ��� *� � �'����-���'� .)���/���� ������� ��� �� ����� ��++���� ;<<6
��� "���� @ "������ #��)������
������ ����� A��� B �C������ ��������� ������� �� ��� ;:�� ����� ������� ��� ��� #���� ���� �D� ������������� ��%�B��5������ �� ����'���� $�� � 2A11� ;<<3
"��� 1:1 (������ ��� 1<: -� ��
� �
��������� ��� ��� "����B��5�����E� ����'���(� � ���� � ���������33;;1 ����'��������F������������������'������
��������� ��� "������ #��B��5�����E� �������� .-�/0������� �D� )��� ������������������ B�������'������� ����871;: �������������������F��������������������
0��! 13?1�::130��! ?�63<�;6788�7 �+�������%���� ����� ����� ��� !�� 4���
*� ���� �� ������� ����� !�' ��G ;<<62;8136
-��� ���� �� �� 9� � �� �+������� � ������ ��� �����5��� ������� ��� ���� �� +��� �� ���'������ �� �� ������ �+� ��� �� ��� ������ �� ����������� ��+�������� ����� �� �������������� �������� ���� ������� ��+���� ���� �� '� ����' �� �� ��� ����� ���� ��� ������� �� ���� ����� ��+� ����� �� ���� +� � ����� �� +���� ������� �� +��'����� ��� ����� ��� +��5������ ����� #��'�� �+������ *�� �� ��+��' �� 2� 1276� �� ��� ������ 5������� ��� +��'������ ��� ���'��� ����� � � ������ ���' �+�������%����� %�������� ��� �� � ��� +���� ����� ����� ���#��'�� �+������ *���
�+������ @ ���� �� �+������ � ��� �H�������� $����
�+������������ �'
I �+�������%���� ����� @ ����� ��� ;<<6������� �� #��'���
-�� ��� �� ������ ��� ��+��5� ��'��� ���������� ��'��� �����'����� �� � �� ���� +� � ��������� ��� �'+�� �5�� �� ��� � ��� � �� � �+� ��� �����'���� ���� �� � ��'�� ��� �>�'+� ���'��� ���5��� +���� ��5� ��� ��� ���������� ��� ��������� ���� ��� ������ ����
���� �5���������G )�� � ��� ����� ����� ���
��0! 11312?76 3?J?16? A 6 3 ? ; 1 < A ������� �� � ������� +�+��
Preface
This volume contains revised versions of selected papers presented duringthe 28th Annual Conference of the Gesellschaft fur Klassifikation (GfKl), theGerman Classification Society. The conference was held at the UniversitatDortmund in Dortmund, Germany, in March 2004. Wolfgang Gaul chairedthe program committee, Claus Weihs and Ernst-Erich Doberkat were thelocal organizers. Patrick Groenen, Iven van Mechelen, and their colleaguesof the Vereniging voor Ordinatie en Classificatie (VOC), the Dutch-FlemishClassification Society, organized special VOC sessions.
The program committee recruited 17 notable and internationally renown-ed invited speakers for plenary and semi-plenary talks on their current re-search work regarding classification and data analysis methods as well as ap-plications. In addition, 172 invited and contributed papers by authors from 18countries were presented at the conference in 52 parallel sessions representingthe whole field addressed by the title of the conference “Classification: TheUbiquitous Challenge”. Among these 52 sessions the VOC organized sessionson Mixture Modelling, Optimal Scaling, Multiway Methods, and Psychomet-rics with 18 papers. Overall, the conference, which is traditionally designed asan interdisciplinary event, again provided an attractive forum for discussionsand mutual exchange of knowledge.
Besides the results obtained in the fundamental subjects Classificationand Data Analysis, the talks in the applied areas focused on various appli-cation topics. Moreover, along with the conference a competition on “SocialMilieus in Dortmund”, co-organized by the city of Dortmund, took place.Hence the presentation of the papers in this volume is arranged in the fol-lowing parts:
I. (Semi-)Plenary PresentationsII. Classification and Data Analysis
III. Applications, andIV. Contest: Social Milieus in Dortmund.
The part on applications has sub-chapters according to the different applica-tion fields Archaeology, Astronomy, Bio-Sciences, Electronic Data and Web,Finance and Insurance, Library Science and Linguistics, Macro-Economics,Marketing, Music Science, and Quality Assurance. Within (sub-)parts pa-pers are mainly arranged in alphabetical order with respect to (first) author’snames.
I.Plenary and semi-plenary lectures enclose both conceptual and applied
papers. Among the conceptual papers Erosheva and Fienberg present a fully
VI Preface
Bayesian approach to soft clustering and classification within a general frame-work of mixed membership, Friendly introduces the Milestones Project ondocumentation and illustration of historical developments in statistical graph-ics, Hornik discusses consensus partitions particularly when applied to ana-lyze the structure of cluster ensembles, Kiers gives an overview of proceduresfor constructing bootstrap confidence intervals for the solutions of three-waycomponent analysis techniques, Pahl argues that a classification frameworkcan organize knowledge about software components’ characteristics, and Uterand Gefeller define partial attributable risk as a unique solution for allocatingshares of attributable risk to risk factors. Within the applied papers Beranpresents preprocessing of musical data utilizing prior knowledge from musicol-ogy, Fischer et al. introduce a method for the prediction of spatial propertiesof molecules from the sequence of amino acids incorporating biological back-ground knowledge, Grzybek et al. discuss how far word length may contributeto quantitative typology of texts, and Snoek and Worring present the TimeInterval Multimedia Event framework as a robust approach for classificationof semantic events in multimodal soccer video.
II.
The second part of this volume is concerned with methodological progressin classification and data analysis and methods presented cover a variety ofdifferent aspects.
In the Classification part, more precise confidence intervals for the pa-rameters of latent class models using the bootstrap method are proposed(Dias), as well as a method of feature selection for ensembles that signif-icantly reduces the dimensionality of subspaces (Gatnar), and a sensitivetwo-stage classification system for the detection of events in spite of a noisybackground in the processing of thousands of images in a few seconds (Haderand Hamprecht). Variants of bagging and boosting are discussed, which makeuse of an ordinal response structure (Hechenbichler and Tutz), a methodologyfor exploring two quality aspects of cluster analyses, namely separation andhomogeneity of clusters (Hennig), and a comparison of Adaboost to Arc-x(h)for different values of h in the subsampling of binary classification data is car-ried out (Khanchel and Limam). The method of distance-based discriminantanalysis (DDA) is introduced finding a linear transformation that optimizesan asymmetric data separability criterion via iterative majorization and thenecessary number of discriminative dimensions (Kosinov et al.), an efficienthybrid methodology to obtain CHAID tree segments based on multiple de-pendent variables of possibly different scale types is proposed (Magidson andVermunt), and possibilities of defining the expectation of p-dimensional inter-vals (Nordhoff) are described. Design of experiments is introduced into vari-able selection in classification (Pumplun et al.), as well as the KMC/EDAMmethod for classification and visualization as an alternative to Kohonen Self-Organizing Maps (Raabe et al.). A clustering of variables approach extended
Preface VII
to situations with missing data based on different imputation methods (Sah-mer et al.), a method for binary online-classification incorporating temporaldistributed information (Schafer et al.), and a concept of characteristic re-gions and a new method, called DiSCo, to simultaneously classify and visu-alize data (Szepannek and Luebke) are described. The part concludes withtwo papers discussing multivariate Pareto Density Estimation (PDE), basedon information optimality, for data sets containing clusters (Ultsch) and anextension of standard latent class or mixture models that can be used for theanalysis of multilevel and repeated measures data (Vermunt and Madgison).
The part on Data Analysis starts with papers proposing a robust pro-cedure for estimating a covariance matrix under conditional independencerestrictions in graphical modelling (Becker) and a new approach to find prin-cipal curves through a multidimensional, possibly branched, data cloud (Ein-beck et al.). A three–way multidimensional scaling approach developed toaccount for individual differences in the judgments about objects, persons orbrands (Krolak-Schwerdt), and the Time Series Knowledge Mining (TSKM)framework to discover temporal structures in multivariate time series basedon the Unification-based Temporal Grammar (UTG) (Morchen and Ultsch)are introduced. A framework for the comparison of the information in contin-uous and categorical data (Nishisato) and an external analysis of two-modethree-way asymmetric multidimensional scaling for the disclosure of asymme-try (Okada and Imaizumi) are presented. Finally, nonparametric regressionwith the Relevance Vector Machine under inclusion of covariate measurementerror (Rummel) is described.
III.In the third part of this volume all contributions are also related to ap-
plications of classification and data analysis methods but structured by theirapplication field.
Two papers deal with applications in Archaeology. The first is a his-torical overview (Ihm) over early publications about formal methods on seri-ation of archaeological finds, in the second article some cluster analysis mod-els including different data transformations in order to differentiate betweenbrickyards of different areas on the basis of chemical analysis are investigated(Mucha et al.).
Another two papers (both by Bailer-Jones) discuss applications in As-tronomy. A brief overview of the upcoming Gaia astronomical survey mis-sion, a major European project to map and classify over a billion stars in ourGalaxy, and an outline of the challenges are given in the first paper while inthe second a novel method based on evolutionary algorithms for designingfilter systems for astronomical surveys in order to provide optimal data onstars and to determine their physical parameters is introduced.
The articles with applications in the Bio-Sciences all deal with enzyme,DNA, microarray, or protein data, except the presentation of results of a sys-
VIII Preface
tematic and quantitative comparison of pattern recognition methods in theanalysis of clinical magnetic resonance spectra applied to the detection ofbrain tumor (Menze et al.). The Generative Topographic Mapping approachas an alternative to SOM for the analysis of microarray data (Grimmensteinet al.) and a finite conservative test for detecting a change point in a bi-nary sequence with Markov dependence and applications in DNA analysis(Krauth) are proposed as well as a new algorithm for finding similar sub-structures in enzyme active sites with the use of emergent self-organizingneural networks (Kupas and Ultsch). How the feature selection procedure“Significance Analysis of Microarrays” (SAM) and the classification method“Prediction Analysis of Microarrays” (PAM) can be applied to “Single Nu-cleotide Polymorphism” (SNP) data is explained (Schwender) as well as thatusing relative differences (RelDiff) instead of LogRatios for cDNA microarrayanalysis solves several problems like unlimited ranges, numerical instabilityand rounding errors (Ultsch). Finally, a novel method, PhyNav, to reconstructthe evolutionary relationship from really large DNA and protein datasets isintroduced applying the maximum likelihood principle (Vinh et al.).
Among the contributions on applications to Electronic Data and Webone paper discusses the application of clustering with restricted random walkson library usage histories in large document sets containing millions of objects(Franke and Thede). In the other four papers different aspects of web-miningare tackled. A tool is described assisting users of online news web-sites inorder to reduce information overload (Bomhardt and Gaul), benchmarks areoffered with respect to competition and visibility indices as predictors fortraffic in web-sites (Schmidt-Manz and Gaul), an algorithm is introduced forfuzzy two-mode clustering that outperforms collaborative filtering (Schlechtand Gaul), and visualizations of online search queries are compared to im-prove understanding of searching, viewing, and buying behavior of onlineshoppers and to further improve the generation of recommendations (Thomaand Gaul).
Two of the articles on Finance and Insurance deal with insuranceproblems: A strategy based on a combination of support vector regressionand kernel logistic regression to detect and to model high-dimensional de-pendency structures in car insurance data sets is proposed (Christmann) andsupport vector machines are compared to traditional statistical classificationprocedures in a life insurance environment (Steel and Hechter). Applicationsin Finance deal with evaluation of global and local statistical models forcomplex data sets of credit risks with respect to practical constraints andasymmetric cost functions (Schwarz and Arminger), show how linear sup-port vector machines select informative patterns from a credit scoring datapool serving as inputs for traditional methods more familiar to practitioners(Stecking and Schebesch), analyze the question of risk budgeting in contin-uous time (Straßberger), and formulate a one-factor model for the correla-tion between probabilities of default across industry branches, comparing it
Preface IX
to more traditional methods on the basis of insolvency rates for Germany(Weißbach and Rosenow).
Besides one contribution on Library Science where it is argued that thehistory of classification is intensively linked to the history of library science(Lorenz) the volume encloses five papers on applications in Linguistics.It is shown that one meta-linguistic relation suffices to model the conceptstructure of the lexicon making use of intensional logic (Bagheri), that im-provements of the morphological segmentation of words using classical dis-tributional methods are possible (Benden), and that in Russian texts (lettersand poems by three different authors) word length is a characteristic of genre,rather than of authorship (Kelih et al.). A validation method of cluster analy-sis methods concerning the number and stability of clusters is described withthe help of an application in linguistics (Mucha and Haimerl), clustering ofword contexts is used in a large collection of texts for word sense induction,i.e. automatic discovery of the possible senses for a given ambiguous word(Rapp), and formal graphs that structure a document-related informationspace by using a natural language processing chain and a wrapping proce-dure are proposed (Rist).
There are three papers with applications in Macro-Economics, two ofthem dealing with the comparison of economic structures of different coun-tries. The sensitivity of economic rankings of countries based on indicatorvariables is discussed (Berrer et al.), structural variables of the 25 memberEuropean Union are analyzed and patterns are found to be quite differentbetween the 15 current and the 10 new members (Sell), while the questionwhether methods measuring (relative) importance of variables in the contextof classification allow interpretation of individual effects of highly correlatedeconomic predictors for the German business cycle (Enache and Weihs) istackled in a more methods-based contribution.
Within the Marketing applications one article shows by means of anintercultural survey (Bauer et al.) that the cyber community is not a homo-geneous group since online consumers can be classified into the three clusters:“risk avers doubters”, “open minded online-shoppers” and “reserved infor-mation seekers”. Two papers deal with reservation prices. A novel estimationprocedure of reservation prices combining adaptive conjoint analysis witha choice task using individually adapted price scales is proposed (Breidertet al.), and an explicit evaluation of variants of conjoint analysis togetherwith two types of data collection is described for the detection of reservationprices of product bundles applied to a seat system offered by a German carmanufacturer (Stauß and Gaul).
Music Science is an application field that is present at GfKl conferencesfor the first time. In this volume one paper deals with time series analysis, theother five papers apply classification methods. A new algorithm structure isintroduced for feature extraction from time series, its efficiency is proofed, andillustrated by different classification tasks for audio data (Mierswa). Classifi-
X Preface
cation methods are used to show that the more the musical sound is unstablein time domain the more pitch bending is admitted to the musician expressingemotions by music (Fricke). Classification rules for quality classes of “sightreading” (SR) are derived (Kopiez et al.) based on indicators of piano prac-tice, mental speed, working memory, inner hearing etc. as well as the total SRperformance of 52 piano students. Classification rules are also found for dig-itized sounds played by different instruments based on the Hough-transform(Rover et al.). Finally, classifications of possibly overlapping drum soundsby linear support vector machines (Van Steelant et al.) and of singers andinstruments into high or low musical registers only by means of timbre, i.e.after elimination of pitch information, are proposed (Weihs et al.).
Applications in Quality Assurance include one methodological paper(Jessenberger and Weihs) which proposes the use of the expected value of theso-called desirability function to assess the capability of a process. The otherpapers discuss different statistical aspects of a deep hole drilling process inmachine building. The Lyapunov exponent is used for the discrimination be-tween well-predictable and not-well-predictable time series with applicationsin quality control (Busse). Two multivariate control charts to monitor thedrilling process in order to prevent chatter vibrations and to secure produc-tion with high quality are proposed (Messaoud et al.) as well as a procedureto assess the changing amplitudes of relevant frequencies over time based onthe distribution of periodogram ordinates (Theis and Weihs).
IV.The fourth part of this volume starts with an introduction to the competi-
tion on “Social Milieus in Dortmund” (Sommerer and Weihs). Moreover, thebest three papers of the competition by Scheid, by Schafer and Lemm, andby Rover and Szepannek appear in this volume. We would like to thank thehead of the “dortmund-project”, Udo Mager, and the head of the Fachbereich“Statistik und Wahlen” of the City of Dortmund, Ernst-Otto Sommerer, fortheir kind support.The conference owed much to its sponsors (in alphabetical order)
• Deutsche Forschungsgemeinschaft (DFG), Bonn,• dortmund-project, Dortmund,• Fachbereich Statistik, Universitat Dortmund, Dortmund,• Landesbeauftragter fur die Beziehungen zwischen den Hochschulen in
NRW und den Beneluxstaaten,• Novartis, Basel, Switzerland,• Roche Diagnostics, Penzberg,• sas Deutschland, Heidelberg,• Sonderforschungsbereich 475, Dortmund,• Springer-Verlag, Heidelberg,• Universitat Dortmund, and• John Wiley and Sons, Chicester, UK.
Preface XI
who helped in many ways. Their generous support is gratefully acknowledged.Additionally, we wish to express our gratitude to the authors of the pa-
pers in the present volume, not only for their contributions, but also for theirdiligence and timely production of the final versions of their papers. Fur-thermore, we thank the reviewers for their careful reviews of the originallysubmitted papers, and in this way, for their support in selecting the bestpapers for this publication.
We would like to emphasize the outstanding work of Uwe Ligges and NilsRaabe who did an excellent job in organizing the program of the confer-ence and the refereeing process as well as in preparing the abstract bookletand this volume, respectively. We also wish to thank our colleague Prof.Dr. Ernst-Erich Doberkat, Fachbereich Informatik, University Dortmund,for co-organizing the conference, and the Fachbereich Statistik of the Uni-versity Dortmund for all the support, in particular Anne Christmann, Dr.Daniel Enache, Isabelle Grimmenstein, Dr. Sonja Kuhnt, Edelgard Kurbis,Karsten Luebke, Dr. Constanze Pumplun, Oliver Sailer, Roland Schultze,Sibylle Sturtz, Dr. Winfried Theis, Magdalena Thone, and Dr. Heike Traut-mann as well as other members and students of the Fachbereich for helping toorganize the conference and making it a big success, and Alla Stankjawitsch-ene and Dr. Stefan Dißmann from the Fachbereich Informatik for all they didin organizing all financial affairs.
Finally, we want to thank Christiane Beisel and Dr. Martina Bihn ofSpringer-Verlag, Heidelberg, for their support and dedication to the produc-tion of this volume.
Dortmund and Karlsruhe, Claus Weihs, Wolfgang GaulApril 2005
Contents
Part I. (Semi-) Plenary Presentations
Classification and Data Mining in Musicology . . . . . . . . . . . . . . . . 3Jan Beran
Bayesian Mixed Membership Models for Soft Clustering andClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Elena A. Erosheva, Stephen E. Fienberg
Predicting Protein Secondary Structure with Markov Models . 27Paul Fischer, Simon Larsen, Claus Thomsen
Milestones in the History of Data Visualization: A Case Studyin Statistical Historiography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Michael Friendly
Quantitative Text Typology: The Impact of Word Length . . . . 53Peter Grzybek, Ernst Stadlober, Emmerich Kelih, Gordana Antic
Cluster Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Kurt Hornik
Bootstrap Confidence Intervals for Three-way ComponentMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Henk A.L. Kiers
Organising the Knowledge Space for Software Components . . . 85Claus Pahl
Multimedia Pattern Recognition in Soccer Video Using TimeIntervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Cees G.M. Snoek, Marcel Worring
Quantitative Assessment of the Responsibility for the DiseaseLoad in a Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Wolfgang Uter, Olaf Gefeller
XIV Contents
Part II. Classification and Data Analysis
Classification
Bootstrapping Latent Class Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Jose G. Dias
Dimensionality of Random Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . 129Eugeniusz Gatnar
Two-stage Classification with Automatic Feature Selection foran Industrial Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Soren Hader, Fred A. Hamprecht
Bagging, Boosting and Ordinal Classification . . . . . . . . . . . . . . . . . 145Klaus Hechenbichler, Gerhard Tutz
A Method for Visual Cluster Validation . . . . . . . . . . . . . . . . . . . . . . 153Christian Hennig
Empirical Comparison of Boosting Algorithms . . . . . . . . . . . . . . . . 161Riadh Khanchel, Mohamed Limam
Iterative Majorization Approach to the Distance-basedDiscriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Serhiy Kosinov, Stephane Marchand-Maillet, Thierry Pun
An Extension of the CHAID Tree-based SegmentationAlgorithm to Multiple Dependent Variables . . . . . . . . . . . . . . . . . . 176
Jay Magidson, Jeroen K. Vermunt
Expectation of Random Sets and the ‘Mean Values’ of IntervalData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Ole Nordhoff
Experimental Design for Variable Selection in Data Bases . . . . 192Constanze Pumplun, Claus Weihs, Andrea Preusser
KMC/EDAM: A New Approach for the Visualization ofK-Means Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Nils Raabe, Karsten Luebke, Claus Weihs
Contents XV
Clustering of Variables with Missing Data: Application toPreference Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Karin Sahmer, Evelyne Vigneau, El Mostafa Qannari, JoachimKunert
Binary On-line Classification Based on Temporally IntegratedInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Christin Schafer, Steven Lemm, Gabriel Curio
Different Subspace Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224Gero Szepannek, Karsten Luebke
Density Estimation and Visualization for Data ContainingClusters of Unknown Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Alfred Ultsch
Hierarchical Mixture Models for Nested Data Structures . . . . . 240Jeroen K. Vermunt, Jay Magidson
Data Analysis
Iterative Proportional Scaling Based on a Robust StartEstimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Claudia Becker
Exploring Multivariate Data Structures with Local PrincipalCurves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Jochen Einbeck, Gerhard Tutz, Ludger Evers
A Three-way Multidimensional Scaling Approach to theAnalysis of Judgments About Persons . . . . . . . . . . . . . . . . . . . . . . . . 264
Sabine Krolak–Schwerdt
Discovering Temporal Knowledge in Multivariate Time Series 272Fabian Morchen, Alfred Ultsch
A New Framework for Multidimensional Data Analysis . . . . . . . 280Shizuhiko Nishisato
External Analysis of Two-mode Three-way AsymmetricMultidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
Akinori Okada, Tadashi Imaizumi
The Relevance Vector Machine Under CovariateMeasurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
David Rummel
XVI Contents
Part III. Applications
Archaeology
A Contribution to the History of Seriation in Archaeology . . . . 307Peter Ihm
Model-based Cluster Analysis of Roman Bricks and Tiles fromWorms and Rheinzabern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Hans-Joachim Mucha, Hans-Georg Bartel, Jens Dolata
Astronomy
Astronomical Object Classification and Parameter Estimationwith the Gaia Galactic Survey Satellite . . . . . . . . . . . . . . . . . . . . . . . 325
Coryn A.L. Bailer-Jones
Design of Astronomical Filter Systems for StellarClassification Using Evolutionary Algorithms . . . . . . . . . . . . . . . . . 330
Coryn A.L. Bailer-Jones
Bio-Sciences
Analyzing Microarray Data with the Generative TopographicMapping Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Isabelle M. Grimmenstein, Karsten Quast, Wolfgang Urfer
Test for a Change Point in Bernoulli Trials with Dependence . 346Joachim Krauth
Data Mining in Protein Binding Cavities . . . . . . . . . . . . . . . . . . . . . 354Katrin Kupas, Alfred Ultsch
Classification of In Vivo Magnetic Resonance Spectra . . . . . . . . 362Bjorn H. Menze, Michael Wormit, Peter Bachert, Matthias Lichy,Heinz-Peter Schlemmer, Fred A. Hamprecht
Modifying Microarray Analysis Methods for Categorical Data– SAM and PAM for SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Holger Schwender
Improving the Identification of Differentially Expressed Genesin cDNA Microarray Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Alfred Ultsch
Contents XVII
PhyNav: A Novel Approach to Reconstruct LargePhylogenies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Le Sy Vinh, Heiko A. Schmidt, Arndt von Haeseler
Electronic Data and Web
NewsRec, a Personal Recommendation System for NewsWebsites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Christian Bomhardt, Wolfgang Gaul
Clustering of Large Document Sets with Restricted RandomWalks on Usage Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Markus Franke, Anke Thede
Fuzzy Two-mode Clustering vs. Collaborative Filtering . . . . . . . 410Volker Schlecht, Wolfgang Gaul
Web Mining and Online Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418Nadine Schmidt-Manz, Wolfgang Gaul
Analysis of Recommender System Usage by MultidimensionalScaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Patrick Thoma, Wolfgang Gaul
Finance and Insurance
On a Combination of Convex Risk Minimization Methods . . . . 434Andreas Christmann
Credit Scoring Using Global and Local Statistical Models . . . . 442Alexandra Schwarz, Gerhard Arminger
Informative Patterns for Credit Scoring Using Linear SVM . . . 450Ralf Stecking, Klaus B. Schebesch
Application of Support Vector Machines in a Life AssuranceEnvironment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Sarel J. Steel, Gertrud K. Hechter
Continuous Market Risk Budgeting in Financial Institutions . . 466Mario Straßberger
Smooth Correlation Estimation with Application toPortfolio Credit Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
Rafael Weißbach and Bernd Rosenow
XVIII Contents
Library Science and Linguistics
How Many Lexical-semantic Relations are Necessary? . . . . . . . . 482Dariusch Bagheri
Automated Detection of Morphemes Using DistributionalMeasurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
Christoph Benden
Classification of Author and/or Genre? The Impact of WordLength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Emmerich Kelih, Gordana Antic, Peter Grzybek, Ernst Stadlober
Some Historical Remarks on Library Classification – a ShortIntroduction to the Science of Library Classification . . . . . . . . . . 506
Bernd Lorenz
Automatic Validation of Hierarchical Cluster Analysis withApplication in Dialectometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
Hans-Joachim Mucha, Edgar Haimerl
Discovering the Senses of an Ambiguous Word by Clusteringits Local Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
Reinhard Rapp
Document Management and the Development of InformationSpaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Ulfert Rist
Macro-Economics
Stochastic Ranking and the Volatility “Croissant”:A Sensitivity Analysis of Economic Rankings . . . . . . . . . . . . . . . . . 537
Helmut Berrer, Christian Helmenstein, Wolfgang Polasek
Importance Assessment of Correlated Predictors in BusinessCycles Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Daniel Enache, Claus Weihs
Economic Freedom in the 25-Member European Union:Insights Using Classification Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Clifford W. Sell
Marketing
Intercultural Consumer Classifications in E-Commerce . . . . . . . 561Hans H. Bauer, Marcus M. Neumann, Frank Huber
Contents XIX
Reservation Price Estimation by Adaptive Conjoint Analysis . 569Christoph Breidert, Michael Hahsler, Lars Schmidt-Thieme
Estimating Reservation Prices for Product Bundles Based onPaired Comparison Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
Bernd Stauß, Wolfgang Gaul
Music Science
Classification of Perceived Musical Intervals . . . . . . . . . . . . . . . . . . 585Jobst P. Fricke
In Search of Variables Distinguishing Low and High Achieversin a Music Sight Reading Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
Reinhard Kopiez, Claus Weihs, Uwe Ligges, Ji In Lee
Automatic Feature Extraction from Large Time Series . . . . . . . . 600Ingo Mierswa
Identification of Musical Instruments by Means of theHough-Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
Christian Rover, Frank Klefenz, Claus Weihs
Support Vector Machines for Bass and Snare DrumRecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
Dirk Van Steelant, Koen Tanghe, Sven Degroeve, Bernard DeBaets, Marc Leman, Jean-Pierre Martens
Register Classification by Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624Claus Weihs, Christoph Reuter, Uwe Ligges
Quality Assurance
Classification of Processes by the Lyapunov Exponent . . . . . . . . 632Anja M. Busse
Desirability to Characterize Process Capability . . . . . . . . . . . . . . . 640Jutta Jessenberger, Claus Weihs
Application and Use of Multivariate Control Charts in a BTADeep Hole Drilling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
Amor Messaoud, Winfried Theis, Claus Weihs, Franz Hering
Determination of Relevant Frequencies and Modeling VaryingAmplitudes of Harmonic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
Winfried Theis, Claus Weihs
XX Contents
Part IV. Contest: Social Milieus in Dortmund
Introduction to the Contest “Social Milieus in Dortmund” . . . 667Ernst-Otto Sommerer, Claus Weihs
Application of a Genetic Algorithm to Variable Selection inFuzzy Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
Christian Rover, Gero Szepannek
Annealed k-Means Clustering and Decision Trees . . . . . . . . . . . . . 682Christin Schafer, Julian Laub
Correspondence Clustering of Dortmund City Districts . . . . . . . 690Stefanie Scheid
Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
Part I
(Semi-) Plenary Presentations
Classification and Data Mining in Musicology
Jan Beran
Department of Mathematics and Statistics,University of Konstanz, 78457 Konstanz, Germany
Abstract. Data in music are complex and highly structured. In this talk a numberof descriptive and model-based methods are discussed that can be used as pre-processing devices before standard methods of classification, clustering etc. canbe applied. The purpose of pre-processing is to incorporate prior knowledge inmusicology and hence to filter out information that is relevant from the point ofview of music theory. This is illustrated by a number of examples from classicalmusic, including the analysis of scores and of musical performance.
1 Introduction
Mathematical considerations in music have a long tradition. The most ob-vious connection between mathematics and music is through physics. Forinstance, in ancient Greece, the Pythagoreans discovered the musical signif-icance of simple frequency ratios such as 2/1 (octave), 3/2 (pure fifth), 4/3(pure fourth) etc., and their relation to the length of a string. There are, how-ever, deeper connections between mathematical and musical structures thatgo far beyond acoustics. Many of these can be discovered using techniquesfrom data mining, together with a priori knowledge from music theory. Theresults can be used, for instance, to solve classification problems. This isillustrated in the following sections by three types of examples.
2 Music, 1/f -noise, fractal and chaos
In their celebrated – but also controversial – paper, Voss and Clarke (1975)postulated that recorded music is essentially 1/f -noise (in the spectral do-main), after high frequencies have been eliminated. (The term 1/f -noise isgenerally used for random processes whose power spectrum is dominated bylow frequencies f such that its value is proportional to 1/f.) Can we verifythis statement? At first, the following question needs to be asked: Whichaspects of a composition does recorded music represent? Sound waves aredetermined not only by the selection of notes, but also by the instrumentalsound itself. It turns out, however, that the sound wave of a musical instru-ment often resembles 1/f -noise (see e.g. Beran (2003)). Thus, if recordedmusic looks like 1/f -noise, this may be due to the instrument rather thana particular composition. To separate instrumental sounds from composedmusic, we therefore consider the score itself, in terms of pitch and onset
4 Beran
time. The problem of superposition of notes in polyphonic music is solvedby replacing chords by arpeggio chords, replacing a chord by the sequence ofnotes in the chord starting with the lowest note. In order to eliminate highfrequencies and to simplify the spectral density, data are aggregated by tak-ing averages over disjoint blocks of k = 7 notes (see Beran and Ocker (2001)and Tsai and Chan (2004) for a theoretical justification). Subsequently, asemiparametric fractional model with nonparametric trend function, the so-called SEMIFAR-model (Beran and Feng (2002), also see Beran (1994)), isfitted to the aggregated series. In a SEMIFAR-model, the stochastic part hasa generalized spectral density behaving at the origin like 1/fα (where f isthe frequency) with α = 2d for some − 1
2 < d. Thus, 1/f -noise correspondsto d = 1/2. Figure 1 shows smoothed histograms of α for four different timeperiods. The results are based on 60 compositions ranging from the 13th tothe 20th century. Apparently a value around α = 1 is favored in classicalmusic up to the early romantic period (first three distributions, from above).However, this preference is less clear in the late 19th and the 20th century.Similar investigations can be made for other characteristics of a composi-tion. For instance, we may consider onset time gaps between the occurenceof a particular note. Figure 2 displays typical log-log-periodograms and fittedspectra, for gap series referring to the most frequent note (modulo 12). Notethat, near zero, each fitted log-log-curve essentially behaves like a straightline with estimated slope α.
In summary, we may say that 1/fα-behaviour with α > 0 appears to becommon for many musical parameters. The fractal parameter α = 2d maybe interpreted as a summary statistic of the degree of variation and memory.From the examples here it is clear, however, that 1/f -noise is not the only,though perhaps the most frequent, type of variation.
3 Music and entropy
The fractal parameter d (or α = 2d) is a measure of randomness and coher-ence (memory) in the sense mentioned above. Another, in some sense moredirect, measure of randomness is entropy. Consider, for instance, the distri-bution of notes modulo 12 and its entropy. We calculate the entropy for 148compositions by the following composers: Anonymus (dates of birth between1200 and 1500), Halle (1240-1287), Ockeghem (1425-1495), Arcadelt (1505-1568), Palestrina (1525-1594), Byrd (1543-1623), Dowland (1562-1626), Has-sler (1564-1612), Schein (1586-1630), Purcell (1659-1695), D. Scarlatti (1660-1725), F. Couperin (1668-1733), Croft (1678-1727), Rameau (1683-1764),J.S. Bach (1685-1750), Campion (1686-1748), Haydn (1732-1809), Clementi(1752-1832), W.A. Mozart (1756-1791), Beethoven (1770-1827), Chopin(1810-1849), Schumann (1810-1856), Wagner (1813-1883), Brahms (1833-1897), Faure (1845-1924), Debussy (1862-1918), Scriabin (1872-1915), Rach-maninoff (1873-1943), Schoenberg (1874-1951), Bartok (1881-1945), Webern
Classification and Data Mining in Musicology 5
d1$x
d1
$y
-1.5 -1.0 -0.5 0.0
0.5
1.5
Distribution of -2d: up to 1700
d2$x
d2
$y
-1.5 -1.0 -0.5 0.0
02
46
Distribution of -2d: 1700-1800
d3$x
d3$y
-1.5 -1.0 -0.5 0.0
0.0
1.0
2.0
Distribution of -2d: 1800-1860
d4$x
d4$y
-1.5 -1.0 -0.5 0.0
0.0
1.0
2.0
Distribution of -2d: after 1860
Fig. 1. Distribution of −α = −2d for four different time periods.
(1883-1945), Prokoffieff (1891-1953), Messiaen (1908-1992), Takemitsu (1930-1996) and Beran (*1959). For a detailed description how the entropy is cal-culated see Beran (2003). A plot of entropy against the date of birth of thecomposer (figure 3) reveals a positive dependence, in particular after 1400.Why that is so can be seen, at least partially, from star plots of the distribu-tions. Figure 4 shows a random selection of star plots ranging from the 15th tothe 20th century. In order to reveal more structure, the 12 note categories areordered according to the ascending circle of fourths. The most striking featureis that for compositions that may be classified as purely tonal in a traditionalsense, there is a neighborhood of 7 to 8 adjacent notes where beams are verylong, and for the rest of the categories not much can be seen. The plausiblereason is that in tonal music, the circle of fourths is a dominating featurethat determines a lot of the structure. This is much less the case for classicalmusic of the 20th century. With respect to entropy it means that for newermusic, the (marginal) distribution of notes is much less predictable than inearlier music (see figure 3 where composers born after 1881 are marked as“20th century”, namely Prokoffieff, Messiaen, Takemitsu, Webern and Be-ran). Note, however, that there are also a few outliers in figure 3. Thus, therule is not universal, and entropy may depend on the individual composer or
6 Beran
log(frequency)
log(
spec
trum
)
0.5 1.0
0.00
10.
005
0.05
0
Bach: Prelude and Fugue, WK I, No. 17,spectrum of aggregated gaps (d=0.5)
log(frequency)
log(
spec
trum
)
0.5 1.0
0.00
50.
050
Rameau: Le Tambourin,spectrum of aggregated gaps (d=0.5)
log(frequency)
log(
spec
trum
)
0.5 1.0
0.00
050.
0050
0.05
000.
5000
Scarlatti: Sonata K49,spectrum of aggregated gaps (d=0.56)
log(frequency)
log(
spec
trum
)
0.5 1.0
0.00
50.
050
Rachmaninoff: op. 3, No. 2,spectrum of aggregated gaps (d=0.5)
Fig. 2. Log-log-periodograms and fitted spectra for gap time series.
even the composition. In the last millennium, music moved gradually fromrather strict rules to increasing variety. It is therefore not surprising thatvariability increases throughout the centuries - composers simply have morechoice. On the other hand, a comparison of Schumann’s entropies (which werenot included in figure 3) with those by Bach points in the opposite direction(figure 5). As a cautionary remark it should also be noted that this data setis a very small, and partially unbalanced, sample from the huge number ofexisting compositions. For instance, Prokoffieff is included 15 times whereasmany other composers of the 20th century are missing. A more systematicempirical investigation will need to be carried out to obtain more conclusiveresults.
4 Score information and performance
Due to advances in music technology, performance theory is a very activearea of research where statistical analysis plays an essential role. In contrastto some other branches of musicology, repeated observations and controlledexperiments can be carried out. With respect to music where a score exists,the following question is essential: Which information is there in a score,and how can it be quantified? Beran and Mazzola (1999a) (also see Maz-zola (2002) and Beran (2003)) propose to encode structural information of a
Classification and Data Mining in Musicology 7
date of birth
en
tro
py
1200 1400 1600 1800
1.8
2.0
2.2
2.4
Arc
ad
elt
Arc
ad
elt
Arc
ad
elt
Pa
lest
rin
aP
ale
strin
aP
ale
strin
aB
yrd
Byr
dB
yrd
Ha
ssle
rH
ass
ler
Ha
ssle
r
Sca
rlatti
Sca
rlatti
Sca
rlatti
Sca
rlatti
Sca
rlatti
Sca
rlatti
Sca
rlatti
Sca
rlatti
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Bach
Hayd
nH
ayd
nH
ayd
nH
ayd
nH
ayd
nH
ayd
nH
ayd
n
Chopin
Chopin
Chopin
Chopin
20th
centu
ry20th
centu
ry
20th
centu
ry
20th
centu
ry
20
th c
en
tury
Fig. 3. Entropy of notes in Z12 versus date of birth.
score by so-called metric, harmonic and melodic weights or indicators. Thesecurves quantify the metric, harmonic and melodic importance of a note re-spectively. A modified motivic indicator based on a priori knowledge aboutmotifs in the score is defined in Beran (2003). Figure 6 shows some indicatorfunctions corresponding to eight different motifs in Schumann’s Traumerei.These curves can be related to observed performance data by various sta-tistical methods (see e.g. Beran (2003), Beran and Mazzola (1999b, 2000,2001)). For instance, figure 7 displays tempo curves of different pianists afterapplying data sharpening with the indicator function of motif 2. Sharpeningwas done by considering only those onset times where the indicator curveof motif 2 is above its 90th percentile. This leads to simplified tempo curveswhere differences and communalities are more visible. Also, sharpened tempocurves can be used as input for other statistical techniques, such as classifi-cation. A typical example is given in figure 8, where clustering is based themotif-2-sharpened tempo curves in figure 7.
Acknowledgements
I would like to thank B. Repp for providing us with the tempo measurements.
8 Beran
OCKEGHEM ARCADELT ARCADELT BYRD RAMEAU RAMEAU
RAMEAU BACH BACH BACH SCARLATTI HAYDN
MOZART MOZART SCHUMANN SCHUMANN SCHUMANN CHOPIN
WAGNER WAGNER DEBUSSY DEBUSSY SCRIABIN SCRIABIN
BARTOK BARTOK BARTOK BARTOK MESSIAEN PROKOFFIEFF
PROKOFFIEFF MESSIAEN SCHOENBERG WEBERN TAKEMITSU BERAN
Fig. 4. Star plots of Z12-distribution, ordered according to the circle of fourths.
1.6
1.8
2.0
2.2
Bach Schumann
Fig. 5. Boxplots of entropies for Bach (left) and Schumann (right), based on notedistribution in Z12.
Classification and Data Mining in Musicology 9
onset time
x1
0 5 10 15 20 25 30
0.0
1.0
2.0
a onset time
x2
0 5 10 15 20 25 30
0.0
1.0
2.0
b
onset time
x3
0 5 10 15 20 25 30
0.0
1.0
2.0
c onset time
x4
0 5 10 15 20 25 30
0.0
1.0
2.0
d
onset time
x5
0 5 10 15 20 25 30
0.0
1.0
2.0
e onset time
x6
0 5 10 15 20 25 30
0.0
1.0
2.0
f
onset time
x7
0 5 10 15 20 25 30
0.0
1.0
2.0
g onset time
x8
0 5 10 15 20 25 30
0.0
1.0
2.0
h
Fig. 6. Motivic indicators for Schumann’s Traumerei.
tem
po[i2
, j]
5 10 15 20
-20
1
ARGERICH
tem
po[i2
, j]
5 10 15 20
-1.5
0.0
ARRAU
tem
po[i2
, j]
5 10 15 20
-3-1
1
ASKENAZE
tem
po[i2
, j]
5 10 15 20
-20
1
BRENDEL
tem
po[i2
, j]
5 10 15 20
-1.5
0.0
BUNIN
tem
po[i2
, j]
5 10 15 20
-2-1
0
CAPOVA
tem
po[i2
, j]
5 10 15 20
-3-1
1
CORTOT1
tem
po[i2
, j]
5 10 15 20
-20
1
CORTOT2
tem
po[i2
, j]
5 10 15 20
-2-1
01
CORTOT3
tem
po[i2
, j]
5 10 15 20
-3-1
01
CURZON
tem
po[i2
, j]
5 10 15 20
-4-2
01
DAVIES
tem
po[i2
, j]
5 10 15 20
-3-1
1
DEMUS
tem
po[i2
, j]
5 10 15 20
-3-1
0
ESCHENBACH
tem
po[i2
, j]
5 10 15 20
-3-1
GIANOLI
tem
po[i2
, j]
5 10 15 20
-1.5
0.0
HOROWITZ1
tem
po[i2
, j]
5 10 15 20
-20
1
HOROWITZ2
tem
po[i2
, j]
5 10 15 20
-20
1
HOROWITZ3
tem
po[i2
, j]
5 10 15 20
-4-2
0
KATSARIS
tem
po[i2
, j]
5 10 15 20
-2.5
-1.0
0.5
KLIEN
tem
po[i2
, j]
5 10 15 20
-4-2
01
KRUST
tem
po[i2
, j]
5 10 15 20
-20
1
KUBALEK
tem
po[i2
, j]
5 10 15 20
-20
1
MOISEIWITSCH
tem
po[i2
, j]
5 10 15 20
-3-1
1
NEY
tem
po[i2
, j]
5 10 15 20
-3-1
1
NOVAES
tem
po[i2
, j]
5 10 15 20
-2.5
-0.5
ORTIZ
tem
po[i2
, j]
5 10 15 20
-2.5
-0.5
SCHNABEL
tem
po[i2
, j]
5 10 15 20
-3-1
1
SHELLEY
tem
po[i2
, j]
5 10 15 20
-20
1
ZAK
Fig. 7. Schumann’s Traumerei: Tempo curves sharpened by 90th percentile ofmotif-curve 2.
10 Beran
AR
GE
RIC
H
AR
RA
U
AS
KE
NA
ZE
BR
EN
DE
L
BU
NIN
CA
PO
VA
CO
RT
OT
1
CO
RT
OT
2CO
RT
OT
3
CU
RZ
ON
DA
VIE
S
DE
MU
S
ES
CH
EN
BA
CH
GIA
NO
LI
HO
RO
WIT
Z1
HO
RO
WIT
Z2
HO
RO
WIT
Z3
KA
TS
AR
IS
KLI
EN
KR
US
T
KU
BA
LEK
MO
ISE
IWIT
SC
H NE
Y
NO
VA
ESOR
TIZ
SC
HN
AB
EL
SH
ELL
EY
ZA
K
12
34
56
Motive-2-indicator: 90%-quantile-clustering
Fig. 8. Schumann’s Traumerei: Tempo clusters based on sharpened tempo.
References
BERAN, J. (2003): Statistics in Musicology. Chapman & Hall, CRC Press, BocaRaton.
BERAN, J. (1994): Statistics for long-memory processes. Chapman & Hall, London.BERAN, J. and FENG, Y. (2002): SEMIFAR models – a semiparametric frame-
work for modeling trends, long-range dependence and nonstationarity. Com-putational Statistics and Data Analysis, 40(2), 690–713.
BERAN, J. and MAZZOLA, G. (1999a): Analyzing musical structure and perfor-mance - a statistical approach. Statistical Science, 14(1), 47–79.
BERAN, J. and MAZZOLA, G. (1999b): Visualizing the relationship between twotime series by hierarchical smoothing. J. Computational and Graphical Statis-tics, 8(2), 213–238.
BERAN, J. and MAZZOLA, G. (2000): Timing Microstructure in Schumann’sTraumerei as an Expression of Harmony, Rhythm, and Motivic Structure inMusic Performance. Computers Mathematics Appl., 39(5-6), 99–130.
BERAN, J. and MAZZOLA, G. (2001): Musical composition and performance -statistical decomposition and interpretation. Student, 4(1), 13–42.
BERAN, J. and OCKER, D. (2000): Temporal Aggregation of Stationary and Non-stationary FARIMA(p,d,0) Models. CoFE Discussion Paper, No. 00/22. Uni-versity of Konstanz.
MAZZOLA, G. (2002): The topos of music. Birkhauser, Basel.TSAI, H. and CHAN, K.S. (2004): Temporal Aggregation of Stationary and Non-
stationary Discrete-Time Processes. Technical Report, No. 330, University ofIowa, Statistics and Actuarial Science.
VOSS, R.F. and CLARKE, J. (1975): 1/f noise in music and speech. Nature, 258,317–318.
Bayesian Mixed Membership Models for Soft
Clustering and Classification
Elena A. Erosheva1 and Stephen E. Fienberg2
1 Department of Statistics,School of Social Work,Center for Statistics and the Social Sciences,University of Washington, Seattle, WA 98195, U.S.A.
2 Department of Statistics,Center for Automated Learning and Discovery,Center for Computer and Communications SecurityCarnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
Abstract. The paper describes and applies a fully Bayesian approach to soft clus-tering and classification using mixed membership models. Our model structurehas assumptions on four levels: population, subject, latent variable, and samplingscheme. Population level assumptions describe the general structure of the popula-tion that is common to all subjects. Subject level assumptions specify the distribu-tion of observable responses given individual membership scores. Membership scoresare usually unknown and hence we can also view them as latent variables, treatingthem as either fixed or random in the model. Finally, the last level of assumptionsspecifies the number of distinct observed characteristics and the number of replica-tions for each characteristic. We illustrate the flexibility and utility of the generalmodel through two applications using data from: (i) the National Long Term CareSurvey where we explore types of disability; (ii) abstracts and bibliographies fromarticles published in The Proceedings of the National Academy of Sciences. In thefirst application we use a Monte Carlo Markov chain implementation for samplingfrom the posterior distribution. In the second application, because of the size andcomplexity of the data base, we use a variational approximation to the posterior.We also include a guide to other applications of mixed membership modeling.
1 Introduction
The canonical clustering problem has traditionally had the following form:for N units or objects measured on J variables, organize the units into Ggroups, where the nature, size, and often the number of the groups is un-specified in advance. The classification problem has a similar form exceptthat the nature and the number of groups are either known theoretically orinferred from units in a training data set with known group assignments. Inmachine learning, methods for clustering and classification are referred toas involving “unsupervised” and “supervised learning” respectively. Most ofthese methods assume that every unit belongs to exactly one group. In thispaper, we will primarily focus on clustering, although methods described canbe used for both clustering and classification problems.
12 Erosheva and Fienberg
Some of the most commonly used clustering methods are based on hi-erarchical or agglomerative algorithms and do not employ distributional as-sumptions. Model-based clustering lets x = (x1, x2, . . . , xJ) be a sample of Jcharacteristics from some underlying joint distribution, Pr(x|θ). Assumingeach sample is coming from one of G groups, we estimate Pr(x|θ) indicat-ing presence of groups or lack thereof. We represent the distribution of thegth group by Prg(x|θ) and then model the observed data using the mixturedistribution:
Pr(x|θ) =G∑
g=1
πgPrg(x|θ), (1)
with parameters {θ, πg}, and G.The assumption that each object belongs exclusively to one of the G
groups or latent classes may not hold, e.g., when characteristics sampled areindividual genotypes, individual responses in an attitude survey, or wordsin a scientific article. In such cases, we say that objects or individuals havemixed membership and the problem involves soft clustering when the natureof groups is unknown or soft classification when the nature of groups is knownthrough distributions Prg(x|θ), g = 1, . . . , G, specified in advance.
Mixed membership models have been proposed for applications in severaldiverse areas. We describe six of these here:
1. NLTCS Disability Data. The National Long Term Care Survey assessesdisability in U.S. elderly population. We have been working with a 216
contingency table on functional disability drawing on combined data fromthe 1982, 1984, 1989, and 1994 waves of the survey. The dimensions ofthe table correspond to 6 Activities of Daily Living (ADLs)–e.g., gettingin/out of bed and using a toilet–and 10 Instrumental Activities of DailyLiving (IADLs)–e.g., managing money and taking medicine. In Section3, we describe some of our results for the combined NLTCS data. Wenote that further model extensions are possible to account for the lon-gitudinal nature of the study, e.g., via employing a powerful conditionalindependence assumption to accommodate a longitudinal data structureas suggested by Manton et al. (1994).
2. DSM-III-R Psychiatric Classifications. One of the earliest proposals formixed membership models was by Woodbury et al. (1978), in the con-text of disease classification. Their model became known as the Gradeof Memebership or GoM model, and was later used by Nurnberg et al.(1999) to study the DSM-III-R typology for psychiatric patients. Theiranalysis involved N = 110 outpatients and used the J = 112 DSM-III-Rdiagnostic criteria for clustering in order to reassess the appropriatenessof the “official” 12 personality disorders. One could also approach thisproblem as a classical classification problem but with J > N.