Comparison of dendrograms: a multivariate approach

Download Comparison of dendrograms: a multivariate approach

Post on 22-Mar-2017

215 views

Category:

Documents

3 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>Comparison of dendrograms: a multivariate approach' </p><p>J . PODANI' A N D T . A . DICK INS ON^ Deprrrtri~c~rlt c!/'Pltrrrt Scierrc.e.s. Ur~i,~e,:sity oj' Westertr Otrtrrrio. Lorrrlorr. Ot~t . , Crrtrrrrlrr N6A -7R I </p><p>Rcccivcd April 17, 1984 </p><p>PODANI. J . , and T. A. DICKINSON. 1984. Comparison of dcnclrogl.anis: a multivariatc approach. Can. J. Bot. 62: 2765 -2778. Thrcc new descriptors of dcndrogram structure (clustcr mcnibcrship divcrgcncc. subtrcc mcmbcrship divcrgcncc. and </p><p>partition mcnibcrship divcrgcncc) have bccn proposccl which supplcmcnt cxisting ones (coplicnctic diffcrcncc and topological diffcrcncc). Thcsc cnablc multivariatc coniparisons among dcndrogranis in a manner that minimizes the drawbacks of individual dcscriptors and crnphasizcs their agrccmcnt. Thc new dcscriptors arc illustrated and comparcd with the cxisting ones by rcfcrcncc to an artificial data set. Thc multivariate comparison of dcntlrogran~s is furtlicr illustratcd using two real data sets, one phcnctic and onc cladistic. Thcsc applications demonstrate the ability of this approach to rcvcal fcaturcs of dcndrogra~ns that may bc data dcpcndcnt, rncthod dcpcndcnt. or due to the interaction of data and rncthod. and that may not bc readily apparent othcrwisc. </p><p>PODANI. J . , ct T. A. DICKINSON. 1984. Comparison of tlcndrograms: a multivariatc approach. Can. J . Bot. 62: 2765-2778. Lcs auteurs prkscntcnt trois nouvcaux dcscriptcurs dc la structure d'un dcndrogrammc: divcrgcncc dans I'appartcnancc i un </p><p>groupc ("clustcr mcnibcrship divcrgcncc"), divcrgcncc dans I'appartcnancc 2 un sous-nrbrc ("s~rbtrcc mcmbcrship divcrgcncc") ct divcrgcncc dans I'appartcnancc i une division ("partition mcmbcrship divcrgcncc"). Ccs dcscriptcurs complttcnt ccux qui existent dkji (diffkrcncc cophknktiquc. diffkrcncc topologiquc). Ils pcrmcttcnt d'cffcctucr dcs compa~.aisons cntrc dcndro- grammcs dc nianicrc B minimiscr Ics lacuncs dcs dcscriptcurs individucls ct ii acccntucr lcur concordancc. Lcs nouvcaux dcscriptcurs sont illustrks ct comparks aux ancicns h I'nidc d'un cnscmblc dc donnkcs artificicllcs. La comparnison multivarikc dcs dcndrogrammcs cst aussi illustrkc par tlcux cnscmblcs dc donnCcs rkcllcs, I'un phknktiquc ct I'nutrc cladistiquc. Ccs applications montrcnt quc notrc approchc cst capable dc rkvklcr Ics caractdristiqucs dcs dcndrograninics qui pcuvcnt dkcoulcr dcs donnkcs, ccllcs qui pcuvcnt dccoulcr dc la mkthodc ou cncorc qui pcuvcnt ttrc dues i I'intcraction dcs donnkcs ct dc la mkthodc; ccs caractkristiqucs nc scraicnt pas facilcs i~ pcrccvoir autrcmcnt. </p><p>[Traduit par Ic journal] </p><p>Introduction In numerical taxonomic st~ldies it is customary to represent </p><p>relationships among the ob.jects under study in the form of trees or dendrograms regardless of whether the approach taken is cladistic (relationships among ob.jects are described in terms of putative ancestry) or phenetic (relationships among objects are described in terms of overall similarity). It is commonly observed that different methods may produce q~li te dissirn- ilar results when applied to the same sct of data (e .g. , Sokal and Rohlf 1962; Hartigan 1975; Jardine and Sibson 197 1). This statement is relevant not only to biological taxonomy but also to any field of science whcre numerical classification is practiced. However, in biological applications of numerical taxonomy many decisions concerning methods are critical without there being hard and fast guidelines to follow. Not only must resemblance (similarity and dissimilarity) coeffi- cients and classification algorithms be chosen, but also other decisions must be made, such as: (i) selection of characters; (ii) selection of character weights and data transformations, if any; and (iii) selection of a method of character coding. More- ovkr, in cladistic analyses the results may be affected by the sequence in which the objects are analyzed (Colless 1983). The complexity of this situation may be appreciated by examining </p><p>'This paper is based on portions of thcscs by the authors. submitted in partial fulfillment of the rcquircmcnts for the Ph.D. dcgrcc at the University of Wcstcrn Ontario. 'l'hc underlying principles of the mcthod introduced and the Crtrttreglrs cxamplc wcrc prcscntcd at the 16th International Nunicrical Taxonomy Confcrcncc at tlic University of Notrc Damc. Notrc Damc, IN. U.S.A.. 22-24 October 1982 (scc Jcnscn 1983). </p><p>'Pcrmancnt addrcss: Rcscarch Institute for Botany. Hungarian Academy of Scicnccs, Vlic~itcit. Hungary H-2163. </p><p>'~rcscnt addrcss: School of Forestry, Lakchead University. Thunder Bay, Ont., Canada P7B 5EI. </p><p>the left portion of Fig. I. For example, dcndrograms may be obtained along the path (1-b-c.. If we consider that each arrow in the scheme is associated with a niultitutle of choices to be made. the number of possible dendrograms for the same set of ob-jects is too large to comprehend. Consequently. it is very important to examine how results are influenced by these choices and which stages of the analysis arc most critical for the consistency and interpretability of the classifications ob- tained. Studies of dendrograrn congruence of this kind would supplement the optimality tests recently reviewed by Rohlf and Sokal (198 1). </p><p>The effect upon dendrogram structure of any change in either algorithm or data may be evaluated by the comparison of results. The comparison of dendrograms ( 0 - D comparisons, in terms of Fig. I ) has been employed in numerical taxonomy (Rohlf 1974; Rohlf and Sokal 1981; Lachance and Starmer 1982; Hopper and Burgman 1983) but not routinely in botanical studies, according to Duncan and Baum (1981). In the most widely adopted approach each dendrogram is described in t e rms~ of an^ 11 X 11 matrix summarizing the pairwise relation- ships among the 11 objects studied in terms of some descriptor of dendrogram structure. Two dendrogralns are then compared by various matrix correlation methods (Duncan ct ul. 1980; Rohlf and Sokal 1981). Note that in this DaDer attributes of . . objects will be designated by reference to characters, whereas the attributes of dendrograms that describe relationships among objects will be designated by reference to descriptors. The distinction between the terms character and descriptor is made here solely for the sake of clarity. </p><p>Dendrogram similarity may also be expressed without comparing descriptor matrices. Examples of methods of direct dendrogram comparison are the ultrametric dissimi- larity coefficient (Dobson '1975; the terminology is ours), the cluster distortion technique (Farris 1973), versions of an edge- </p><p>Can</p><p>. J. B</p><p>ot. D</p><p>ownl</p><p>oade</p><p>d fr</p><p>om w</p><p>ww</p><p>.nrc</p><p>rese</p><p>arch</p><p>pres</p><p>s.co</p><p>m b</p><p>y U</p><p>nive</p><p>rsity</p><p> of </p><p>P.E</p><p>.I. o</p><p>n 11</p><p>/18/</p><p>14Fo</p><p>r pe</p><p>rson</p><p>al u</p><p>se o</p><p>nly.</p></li><li><p>CAN. 1. BOT. VOL. 62. 1984 </p><p>FIG. I. Scheme illustrating the pathways of numerical taxonomic studies, supplcmcntcd by the multivariate comparison of dendrograms (expanded frorn schcmata dcscribcd by Rohlf and Sokal 1981 and Duncan and Baum 198 1 ). The set of objects (0) is dcscribcd by a raw data matrix (X) from which the resemblance matrix for objccts (S) is obtained. D denotes dcndrograms derived either from the resemblance matrix or directly frorn the data. Comparison of several dendrograms lcads to their rcscmblancc matrix (SI,) and then to an ordination (Mo) or clustcring (Dl,). Lower case letters indicate pathways along which the investigator is faced with multiple choices in a conventional numerical taxonomic study. </p><p>matching coefficient (Robinson and Foulds 1979; Podani 1982), and the number of nearest-ncighbor interchanges (Robinson 1971; Waterman and Smith 1978; Jarvis et al. 1983). </p><p>Finally, dendrogram comparisons are implicit when con- sensus trees (Adams 1972) are used to summarize the informa- tion contained in alternative dendrograms (Seaman and Funk 1983; Smith and Phipps 1984). Relatcd approaches for explicit comparisons may employ consensus indices (Mickevich 1978; Rohlf 1982). Such methods are obviously inappropriate for study of the problems addressed above, since they are based on determining the departure from a standard of the consensus tree for a set of dendrograms obtained using a particular method or set of data. </p><p>It must be noted that the comparison of the resemblance matrices from which the dendrograms were derived (S-S com- parisons, Fig. I ) may appear to represent a simpler method of analysis, e.g., of the effect of character selection alternatives. However, D-D comparisons are more widely applicable. for several reasons. (i) Except for special purposes (e.g., Gilmartin 1974, 1980), taxonomists are in fact interested in the classi- fications derived from dendrograms more than they are interested in resemblance structures. (ii) Dendrograms may be generated without calculating resemblance matrices (X + D, Fig. I; character con~patibility methods (Duncan et rrl. 1980) as well as monothetic divisive clustering methods (Orloci 1978)). (iii) The correlation between resemblance matrices is not always meaningful, as in the case of that between a corre- lation matrix and a distance matrix. (iv) The effect of sorting algorithm can only be analyzed by means of D-D compari- sons. (v) It is not known at the outset whether minor changes in the resemblance structure can cause drastic changes in hier- archical relationships, or whether substantial perturbations of the resemblance structure may nevertheless lead to similar hierarchies, if the group structure is sharp. Consequently, S-S comparisons may produce misleading results with respect to the relationships among dendrograms and hence, among classifications. </p><p>Nevertheless, as shown by Douglas and Endler (1982) and by Dietz (1983), S-S comparisons may still prove useful in taxonomic and evolutionary studies, but further discussion of these methods is beyond the scope of this paper. So too is a discussion of methods concerned with comparisons of taxo- nomic classifications, as distinct from the dendrograms from which they may be derived, with respect to their predictiveness (Hawksworth et al. 1968; Duncan and Baum 1981). </p><p>All methods of D-D comparisons currently available suffer from the drawback that only one aspect of dendrogram resem- blance is taken into account (this insufficiency is discussed at </p><p>FIG. 2. Artificial dendrograms illustrating diffcrent classifications of six objccts. </p><p>length below). This fact is all the more surprising because all other stages of a numerical taxonomic study are multivariate in nature. The present paper is an attempt, therefore, to meet a long-standing need by providing a means for dendrogram com- parisons such that several features of dendrogram structure are incorporated into a single resemblance function. </p><p>Materials In addition to analyses of an artificial data set we illustrate the </p><p>method for multivariate comparison ot dendrograms dcscribcd below by means of analyscs of two rcal data sets. The latter analyses em- ployed a number of computer programs beside the one for calculating dcndrogram descriptors dcscribcd in thc Appendix. Thcsc included programs EUCD and PCAR, given by Orl6ci (1978) and the first author's programs for cluster analysis and principal coordinates anal- ysis, NCLAS and PRINCOOR, rcspcctrvcly (Podani 1980. 1984). All of thcsc ran on the DEC system 1090 installation of the University of Western Ontario Computing Centre. </p><p>Arr;fi'citrl clendrogrtrn~.~ Eighteen different dcndrograrns for six objects wcrc constructed </p><p>(Fig. 2). For thc sake of simplicity only six hierarchical lcvcls wcrc distinguishcd. Dendrograms A- P arc binary trees, while Q and R arc nonbinary oncs. Dcndrogram P differs from all of the othcrs in that it contains a rcvcrsal. Only dendrograms A - 0 arc included in the anal- ysis bclow; the other three merely illustrate certain aspects of thc terminology used in presenting the mcthod. </p><p>Most of the artificial classifications were gcncratctl by elementary transformations from arbitrarily constructed oncs. Dcndrogram A was transformed into B. C. D. 1, and K by interchanges of objects or by shifting hierarchical levcls. Dendrograms A and J have similar topo- </p><p>Can</p><p>. J. B</p><p>ot. D</p><p>ownl</p><p>oade</p><p>d fr</p><p>om w</p><p>ww</p><p>.nrc</p><p>rese</p><p>arch</p><p>pres</p><p>s.co</p><p>m b</p><p>y U</p><p>nive</p><p>rsity</p><p> of </p><p>P.E</p><p>.I. o</p><p>n 11</p><p>/18/</p><p>14Fo</p><p>r pe</p><p>rson</p><p>al u</p><p>se o</p><p>nly.</p></li><li><p>PODANl AND </p><p>FIG. 3. Flowchart illustrating the experimental design used in the analysis of Crnttrc~g~rs phenograms. The meaning of symbols 0. X, S . and D is the same as in Fig. I . The first subscript indicates the type of data. as described in Table I . The second subscript stands for the type of resemblance function used: 1, information radius; E, Euclidean distance; G. gcneralizcd distancc. The third subscript indicatcs thc type of classification strategy: S. single linkage; A, avcragc linkage; and C, complcte linkagc. Sf, represents the matrix of Euclidean dis- tanccs among phenograms calculated using thc dendrogram deserip- tors described in the text. </p><p>logies but different hierarchical levels. Dendrogram C was obtained from A by nearest-ncighbor interchange at the lowcst level. Dendro- gram A was transformed to D by the relocation of object 6. Dendro- kram B was obtained by removing object I from the "seed" of a subtree in A. Another group of dendrograms (E. F. G. and 0) demon- strates a typical series of fusions called chaining. Dendrogram F differs from E in a nearest-neighbor interchangc at a low level, while G was obtained from E by interchanging two objects at a higher level. The fusion sequence of objects in 0 is just the opposite of that in dendrogram E (chain inversion). Dendrograms H and I imply the same three-cluster classification, while M is as diffcrent frorn H and I at this level as possible. Dendrogram N was obtained from M by destroying one subtree in M. Dendrogram L represents a unique classification. apparently dissimilar to all others. </p><p>Six matrices of Euclidean distances werc calculated among dendro- grams A-0 . The first five are based each on onc of the five dendro- gram descriptors described below, while in the sixth all five are incorporated simultaneously, using function 9 givcn below. Each dis- tancematrix was subjected to principal coordinates analysis (PCoA) and minimum variance clustering (Ward 1963). Other ordination and clustering methods might equally be used; those employed in thc examples here were selected because they are commonly used and their properties are well-known. </p><p>Der~tlro,qrtrtns of' phrtretic relrrtiot~.ship.s The first of the two real data sets comes from a study of variation </p><p>in Crottrc,glr.s c.rus-grrlli L. serlslr loto in Ontario (Dickinson 1983; Dickinson and Phipps 1984. 1985). 'The ob.jective of the portion of the study described hcre was to determine the range of taxonomic struc- tures (classifications) supported by the data available. I t was of interest to see to what extent particular classifications wcrc the outcome of particular combinations of method and character suite. Sixty hawthor...</p></li></ul>