everitt, landou cluster analysis

Download Everitt, landou   cluster analysis

Post on 11-Aug-2014

186 views

Category:

Data & Analytics

1 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • EVERITT LANDAU LEESE STAHL ClusterAnalysis5thEdition Cluster Analysis 5th Edition Brian S. Everitt, Sabine Landau, Morven Leese and Daniel Stahl Kings College London, UK Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organizing multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques have proven useful in a wide range of areas such as medicine, psychology, market research and bioinformatics. This 5th edition of the highly successful Cluster Analysis includes coverage of the latest developments in the field and a new chapter dealing with finite mixture models for structured data. Real life examples are used throughout to demonstrate the application of the theory, and figures are used extensively to illustrate graphical techniques. The book is comprehensive yet relatively non-mathematical, focusing on the practical aspects of cluster analysis. Key Features: Presents a comprehensive guide to clustering techniques, with focus on the practical aspects of cluster analysis. Provides a thorough revision of the fourth edition, including new developments in clustering longitudinal data and examples from bioinformatics and gene studies. Updates the chapter on mixture models to include recent developments and presents a new chapter on mixture modelling for structured data. Practitioners and researchers working in cluster analysis and data analysis will benefit from this book. Red box rules are for proof stage only. Delete before final printing. WILEY SERIES IN PROBABILITY AND STATISTICS Cluster Analysis 5th Edition Brian S. Everitt Sabine Landau Morven Leese Daniel Stahl
  • Cluster Analysis
  • WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors David J. Balding, Noel A.C. Cressie, Garrett M. Fitzmaurice, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F.M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti Vic Barnett, Ralph A. Bradley, J. Stuart Hunter, J.B. Kadane, David G. Kendall, Jozef L. Teugels A complete list of the titles in this series can be found on http://www.wiley.com/ WileyCDA/Section/id-300611.html.
  • Cluster Analysis 5th Edition Brian S. Everitt . Sabine Landau Morven Leese . Daniel Stahl Kings College London, UK
  • This edition rst published 2011 2011 John Wiley & Sons, Ltd Registered ofce John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial ofces, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identied as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data Everitt, Brian. Cluster Analysis / Brian S. Everitt. 5th ed. p. cm. (Wiley series in probability and statistics ; 848) Summary: This edition provides a thorough revision of the fourth edition which focuses on the practical aspects of cluster analysis and covers new methodology in terms of longitudinal data and provides examples from bioinformatics. Real life examples are used throughout to demonstrate the application of the theory, and gures are used extensively to illustrate graphical techniques. This book includes an appendix of getting started on cluster analysis using R, as well as a comprehensive and up-to-date bibliography. Provided by publisher. Summary: This edition provides a thorough revision of the fourth edition which focuses on the practical aspects of cluster analysis and covers new methodology in terms of longitudinal data and provides examples from bioinformatics Provided by publisher. Includes bibliographical references and index. ISBN 978-0-470-74991-3 (hardback) 1. Cluster analysis. I. Title. QA278.E9 2011 519.53dc22 2010037932 A catalogue record for this book is available from the British Library. Print ISBN: 978-0-470-74991-3 ePDF ISBN: 978-0-470-97780-4 oBook ISBN: 978-0-470-97781-1 ePub ISBN: 978-0-470-97844-3 Set in 10.25/12pt Times Roman by Thomson Digital, Noida, India
  • To Joanna, Rachel, Hywel and Dafydd Brian Everitt To Premjit Sabine Landau To Peter Morven Leese To Charmen Daniel Stahl
  • Contents Preface xiii Acknowledgement xv 1 An Introduction to classication and clustering 1 1.1 Introduction 1 1.2 Reasons for classifying 3 1.3 Numerical methods of classication cluster analysis 4 1.4 What is a cluster? 7 1.5 Examples of the use of clustering 9 1.5.1 Market research 9 1.5.2 Astronomy 9 1.5.3 Psychiatry 10 1.5.4 Weather classication 11 1.5.5 Archaeology 12 1.5.6 Bioinformatics and genetics 12 1.6 Summary 13 2 Detecting clusters graphically 15 2.1 Introduction 15 2.2 Detecting clusters with univariate and bivariate plots of data 16 2.2.1 Histograms 16 2.2.2 Scatterplots 16 2.2.3 Density estimation 19 2.2.4 Scatterplot matrices 24 2.3 Using lower-dimensional projections of multivariate data for graphical representations 29 2.3.1 Principal components analysis of multivariate data 29 2.3.2 Exploratory projection pursuit 32 2.3.3 Multidimensional scaling 36 2.4 Three-dimensional plots and trellis graphics 38 2.5 Summary 41
  • 3 Measurement of proximity 43 3.1 Introduction 43 3.2 Similarity measures for categorical data 46 3.2.1 Similarity measures for binary data 46 3.2.2 Similarity measures for categorical data with more than two levels 47 3.3 Dissimilarity and distance measures for continuous data 49 3.4 Similarity measures for data containing both continuous and categorical variables 54 3.5 Proximity measures for structured data 56 3.6 Inter-group proximity measures 61 3.6.1 Inter-group proximity derived from the proximity matrix 61 3.6.2 Inter-group proximity based on group summaries for continuous data 61 3.6.3 Inter-group proximity based on group summaries for categorical data 62 3.7 Weighting variables 63 3.8 Standardization 67 3.9 Choice of proximity measure 68 3.10 Summary 69 4 Hierarchical clustering 71 4.1 Introduction 71 4.2 Agglomerative methods 73 4.2.1 Illustrative examples of agglomerative methods 73 4.2.2 The standard agglomerative methods 76 4.2.3 Recurrence formula for agglomerative methods 78 4.2.4 Problems of agglomerative hierarchical methods 80 4.2.5 Empirical studies of hierarchical agglomerative methods 83 4.3 Divisive methods 84 4.3.1 Monothetic divisive methods 84 4.3.2 Polythetic divisive methods 86 4.4 Applying the hierarchical clustering process 88 4.4.1 Dendrograms and other tree representations 88 4.4.2 Comparing dendrograms and measuring their distortion 91 4.4.3 Mathematical properties of hierarchical methods 92 4.4.4 Choice of partition the problem of the number of groups 95 4.4.5 Hierarchical algorithms 96 4.4.6 Methods for large data sets 97 4.5 Applications of hierarchical methods 98 4.5.1 Dolphin whistles agglomerative clustering 98 4.5.2 Needs of psychiatric patients monothetic divisive clustering 101 4.5.3 Globalization of cities polythetic divisive method 101 viii CONTENTS
  • 4.5.4 Womens life histories divisive clustering of sequence data 105 4.5.5 Composition of mammals milk exemplars, dendrogram seriation and choice of partition 107 4.6 Summary 110 5 Optimization clustering techniques 111 5.1 Introduction 111 5.2 Clustering criteria derived from the dissimilarity matrix 112 5.3 Clustering criteria derived from continuous data 113 5.3.1 Minimization of trace(W) 114 5.3.2 Minimization of det(W) 115 5.3.3 Maximization of trace (BW1 ) 115 5.3.4 Properties of the clustering criteria 115 5.3.5 Alternative criteria for clusters of different shapes and sizes 116 5.4 Optimization algorithms 121 5.4.1 Numerical example 124 5.4.2 More on k-means 125 5.4.3 Software implementations of optimization clustering 126 5.5 Choosing the number of clusters 126 5.6 Applications of optimization methods 130 5.6.1 Survey of student attitudes towards video games 130 5.6.2 Air pollution indicators for US cities 133 5.6.3 Aesthetic judgement of painters 136 5.6.4 Classication of nonspecic back pain 141 5.7 Summary 142 6 Finite mixture densities as models for cluster analysis 143 6.1 Introduction 14