Transcript
Page 1: Comparison of dendrograms: a multivariate approach

Comparison of dendrograms: a multivariate approach'

J . PODANI' A N D T . A . DICK INS ON^ Deprrrtri~c~rlt c!/'Pltrrrt Scierrc.e.s. Ur~i,~e,:sity oj' Westertr Otrtrrrio. Lorrrlorr. Ot~t . , Crrtrrrrlrr N6A -7R I

Rcccivcd April 17, 1984

PODANI. J . , and T. A. DICKINSON. 1984. Comparison of dcnclrogl.anis: a multivariatc approach. Can. J. Bot. 62: 2765 -2778. Thrcc new descriptors of dcndrogram structure (clustcr mcnibcrship divcrgcncc. subtrcc mcmbcrship divcrgcncc. and

partition mcnibcrship divcrgcncc) have bccn proposccl which supplcmcnt cxisting ones (coplicnctic diffcrcncc and topological diffcrcncc). Thcsc cnablc multivariatc coniparisons among dcndrogranis in a manner that minimizes the drawbacks of individual dcscriptors and crnphasizcs their agrccmcnt. Thc new dcscriptors arc illustrated and comparcd with the cxisting ones by rcfcrcncc to an artificial data set. Thc multivariate comparison of dcntlrogran~s is furtlicr illustratcd using two real data sets, one phcnctic and onc cladistic. Thcsc applications demonstrate the ability of this approach to rcvcal fcaturcs of dcndrogra~ns that may bc data dcpcndcnt, rncthod dcpcndcnt. or due to the interaction of data and rncthod. and that may not bc readily apparent othcrwisc.

PODANI. J . , ct T. A. DICKINSON. 1984. Comparison of tlcndrograms: a multivariatc approach. Can. J . Bot. 62: 2765-2778. Lcs auteurs prkscntcnt trois nouvcaux dcscriptcurs dc la structure d'un dcndrogrammc: divcrgcncc dans I'appartcnancc i un

groupc ("clustcr mcnibcrship divcrgcncc"), divcrgcncc dans I'appartcnancc 2 un sous-nrbrc ("s~rbtrcc mcmbcrship divcrgcncc") ct divcrgcncc dans I'appartcnancc i une division ("partition mcmbcrship divcrgcncc"). Ccs dcscriptcurs complttcnt ccux qui existent dkji (diffkrcncc cophknktiquc. diffkrcncc topologiquc). Ils pcrmcttcnt d'cffcctucr dcs compa~.aisons cntrc dcndro- grammcs dc nianicrc B minimiscr Ics lacuncs dcs dcscriptcurs individucls ct ii acccntucr lcur concordancc. Lcs nouvcaux dcscriptcurs sont illustrks ct comparks aux ancicns h I'nidc d'un cnscmblc dc donnkcs artificicllcs. La comparnison multivarikc dcs dcndrogrammcs cst aussi illustrkc par tlcux cnscmblcs dc donnCcs rkcllcs, I'un phknktiquc ct I'nutrc cladistiquc. Ccs applications montrcnt quc notrc approchc cst capable dc rkvklcr Ics caractdristiqucs dcs dcndrograninics qui pcuvcnt dkcoulcr dcs donnkcs, ccllcs qui pcuvcnt dccoulcr dc la mkthodc ou cncorc qui pcuvcnt ttrc dues i I'intcraction dcs donnkcs ct dc la mkthodc; ccs caractkristiqucs nc scraicnt pas facilcs i~ pcrccvoir autrcmcnt.

[Traduit par Ic journal]

Introduction In numerical taxonomic st~ldies it is customary to represent

relationships among the ob.jects under study in the form of trees or dendrograms regardless of whether the approach taken is cladistic (relationships among ob.jects are described in terms of putative ancestry) or phenetic (relationships among objects are described in terms of overall similarity). It is commonly observed that different methods may produce q~li te dissirn- ilar results when applied to the same sct of data (e .g. , Sokal and Rohlf 1962; Hartigan 1975; Jardine and Sibson 197 1). This statement is relevant not only to biological taxonomy but also to any field of science whcre numerical classification is practiced. However, in biological applications of numerical taxonomy many decisions concerning methods are critical without there being hard and fast guidelines to follow. Not only must resemblance (similarity and dissimilarity) coeffi- cients and classification algorithms be chosen, but also other decisions must be made, such as: (i) selection of characters; (ii) selection of character weights and data transformations, if any; and (iii) selection of a method of character coding. More- ovkr, in cladistic analyses the results may be affected by the sequence in which the objects are analyzed (Colless 1983). The complexity of this situation may be appreciated by examining

'This paper is based on portions of thcscs by the authors. submitted in partial fulfillment of the rcquircmcnts for the Ph.D. dcgrcc at the University of Wcstcrn Ontario. 'l'hc underlying principles of the mcthod introduced and the Crtrttreglrs cxamplc wcrc prcscntcd at the 16th International Nunicrical Taxonomy Confcrcncc at tlic University of Notrc Damc. Notrc Damc, IN. U.S.A.. 22-24 October 1982 (scc Jcnscn 1983).

'Pcrmancnt addrcss: Rcscarch Institute for Botany. Hungarian Academy of Scicnccs, Vlic~itcit. Hungary H-2163.

'~rcscnt addrcss: School of Forestry, Lakchead University. Thunder Bay, Ont., Canada P7B 5EI.

the left portion of Fig. I. For example, dcndrograms may be obtained along the path (1-b-c.. If we consider that each arrow in the scheme is associated with a niultitutle of choices to be made. the number of possible dendrograms for the same set of ob-jects is too large to comprehend. Consequently. it is very important to examine how results are influenced by these choices and which stages of the analysis arc most critical for the consistency and interpretability of the classifications ob- tained. Studies of dendrograrn congruence of this kind would supplement the optimality tests recently reviewed by Rohlf and Sokal (198 1).

The effect upon dendrogram structure of any change in either algorithm or data may be evaluated by the comparison of results. The comparison of dendrograms ( 0 - D comparisons, in terms of Fig. I ) has been employed in numerical taxonomy (Rohlf 1974; Rohlf and Sokal 1981; Lachance and Starmer 1982; Hopper and Burgman 1983) but not routinely in botanical studies, according to Duncan and Baum (1981). In the most widely adopted approach each dendrogram is described in t e rms~ of an^ 11 X 11 matrix summarizing the pairwise relation- ships among the 11 objects studied in terms of some descriptor of dendrogram structure. Two dendrogralns are then compared by various matrix correlation methods (Duncan ct ul. 1980; Rohlf and Sokal 1981). Note that in this DaDer attributes of . . objects will be designated by reference to characters, whereas the attributes of dendrograms that describe relationships among objects will be designated by reference to descriptors. The distinction between the terms character and descriptor is made here solely for the sake of clarity.

Dendrogram similarity may also be expressed without comparing descriptor matrices. Examples of methods of direct dendrogram comparison are the ultrametric dissimi- larity coefficient (Dobson '1975; the terminology is ours), the cluster distortion technique (Farris 1973), versions of an edge-

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 2: Comparison of dendrograms: a multivariate approach

CAN. 1. BOT. VOL. 62. 1984

FIG. I. Scheme illustrating the pathways of numerical taxonomic studies, supplcmcntcd by the multivariate comparison of dendrograms (expanded frorn schcmata dcscribcd by Rohlf and Sokal 1981 and Duncan and Baum 198 1 ). The set of objects (0) is dcscribcd by a raw data matrix (X) from which the resemblance matrix for objccts (S) is obtained. D denotes dcndrograms derived either from the resemblance matrix or directly frorn the data. Comparison of several dendrograms lcads to their rcscmblancc matrix (SI,) and then to an ordination (Mo) or clustcring (Dl,). Lower case letters indicate pathways along which the investigator is faced with multiple choices in a conventional numerical taxonomic study.

matching coefficient (Robinson and Foulds 1979; Podani 1982), and the number of nearest-ncighbor interchanges (Robinson 1971; Waterman and Smith 1978; Jarvis et al. 1983).

Finally, dendrogram comparisons are implicit when con- sensus trees (Adams 1972) are used to summarize the informa- tion contained in alternative dendrograms (Seaman and Funk 1983; Smith and Phipps 1984). Relatcd approaches for explicit comparisons may employ consensus indices (Mickevich 1978; Rohlf 1982). Such methods are obviously inappropriate for study of the problems addressed above, since they are based on determining the departure from a standard of the consensus tree for a set of dendrograms obtained using a particular method or set of data.

It must be noted that the comparison of the resemblance matrices from which the dendrograms were derived (S-S com- parisons, Fig. I ) may appear to represent a simpler method of analysis, e.g., of the effect of character selection alternatives. However, D-D comparisons are more widely applicable. for several reasons. (i) Except for special purposes (e.g., Gilmartin 1974, 1980), taxonomists are in fact interested in the classi- fications derived from dendrograms more than they are interested in resemblance structures. (ii) Dendrograms may be generated without calculating resemblance matrices (X + D, Fig. I; character con~patibility methods (Duncan et rrl. 1980) as well as monothetic divisive clustering methods (Orloci 1978)). (iii) The correlation between resemblance matrices is not always meaningful, as in the case of that between a corre- lation matrix and a distance matrix. (iv) The effect of sorting algorithm can only be analyzed by means of D-D compari- sons. (v) It is not known at the outset whether minor changes in the resemblance structure can cause drastic changes in hier- archical relationships, or whether substantial perturbations of the resemblance structure may nevertheless lead to similar hierarchies, if the group structure is sharp. Consequently, S-S comparisons may produce misleading results with respect to the relationships among dendrograms and hence, among classifications.

Nevertheless, as shown by Douglas and Endler (1982) and by Dietz (1983), S-S comparisons may still prove useful in taxonomic and evolutionary studies, but further discussion of these methods is beyond the scope of this paper. So too is a discussion of methods concerned with comparisons of taxo- nomic classifications, as distinct from the dendrograms from which they may be derived, with respect to their predictiveness (Hawksworth et al. 1968; Duncan and Baum 1981).

All methods of D-D comparisons currently available suffer from the drawback that only one aspect of dendrogram resem- blance is taken into account (this insufficiency is discussed at

FIG. 2. Artificial dendrograms illustrating diffcrent classifications of six objccts.

length below). This fact is all the more surprising because all other stages of a numerical taxonomic study are multivariate in nature. The present paper is an attempt, therefore, to meet a long-standing need by providing a means for dendrogram com- parisons such that several features of dendrogram structure are incorporated into a single resemblance function.

Materials In addition to analyses of an artificial data set we illustrate the

method for multivariate comparison ot dendrograms dcscribcd below by means of analyscs of two rcal data sets. The latter analyses em- ployed a number of computer programs beside the one for calculating dcndrogram descriptors dcscribcd in thc Appendix. Thcsc included programs EUCD and PCAR, given by Orl6ci (1978) and the first author's programs for cluster analysis and principal coordinates anal- ysis, NCLAS and PRINCOOR, rcspcctrvcly (Podani 1980. 1984). All of thcsc ran on the DEC system 1090 installation of the University of Western Ontario Computing Centre.

Arr;fi'citrl clendrogrtrn~.~ Eighteen different dcndrograrns for six objects wcrc constructed

(Fig. 2). For thc sake of simplicity only six hierarchical lcvcls wcrc distinguishcd. Dendrograms A- P arc binary trees, while Q and R arc nonbinary oncs. Dcndrogram P differs from all of the othcrs in that it contains a rcvcrsal. Only dendrograms A - 0 arc included in the anal- ysis bclow; the other three merely illustrate certain aspects of thc terminology used in presenting the mcthod.

Most of the artificial classifications were gcncratctl by elementary transformations from arbitrarily constructed oncs. Dcndrogram A was transformed into B. C. D. 1, and K by interchanges of objects or by shifting hierarchical levcls. Dendrograms A and J have similar topo-

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 3: Comparison of dendrograms: a multivariate approach

PODANl AND

FIG. 3. Flowchart illustrating the experimental design used in the analysis of Crnttrc~g~rs phenograms. The meaning of symbols 0. X, S . and D is the same as in Fig. I . The first subscript indicates the type of data. as described in Table I . The second subscript stands for the type of resemblance function used: 1, information radius; E, Euclidean distance; G. gcneralizcd distancc. The third subscript indicatcs thc type of classification strategy: S. single linkage; A, avcragc linkage; and C, complcte linkagc. Sf, represents the matrix of Euclidean dis- tanccs among phenograms calculated using thc dendrogram deserip- tors described in the text.

logies but different hierarchical levels. Dendrogram C was obtained from A by nearest-ncighbor interchange at the lowcst level. Dendro- gram A was transformed to D by the relocation of object 6. Dendro- kram B was obtained by removing object I from the "seed" of a subtree in A. Another group of dendrograms (E. F. G. and 0) demon- strates a typical series of fusions called chaining. Dendrogram F differs from E in a nearest-neighbor interchangc at a low level, while G was obtained from E by interchanging two objects at a higher level. The fusion sequence of objects in 0 is just the opposite of that in dendrogram E (chain inversion). Dendrograms H and I imply the same three-cluster classification, while M is as diffcrent frorn H and I at this level as possible. Dendrogram N was obtained from M by destroying one subtree in M. Dendrogram L represents a unique classification. apparently dissimilar to all others.

Six matrices of Euclidean distances werc calculated among dendro- grams A-0 . The first five are based each on onc of the five dendro- gram descriptors described below, while in the sixth all five are incorporated simultaneously, using function 9 givcn below. Each dis- tancematrix was subjected to principal coordinates analysis (PCoA) and minimum variance clustering (Ward 1963). Other ordination and clustering methods might equally be used; those employed in thc examples here were selected because they are commonly used and their properties are well-known.

Der~tlro,qrtrtns of' phrtretic relrrtiot~.ship.s The first of the two real data sets comes from a study of variation

in Crottrc,glr.s c.rus-grrlli L. serlslr loto in Ontario (Dickinson 1983; Dickinson and Phipps 1984. 1985). 'The ob.jective of the portion of the study described hcre was to determine the range of taxonomic struc- tures (classifications) supported by the data available. I t was of interest to see to what extent particular classifications wcrc the outcome of particular combinations of method and character suite. Sixty hawthorn trees sampled randomly at five sites in southern Ontario were used as operational taxonomic units (OTUs). This sample comprised four taxa: three making up Crtrtcleglrs c~1r.s-~crlli L. .sc~t~.s~r ltrto (Crcrtaeglrs Sect. Cr1r.s-gerlli : C. cr1r.s-gnlli s. str., N = 32; C. fi)tltatle.sitrt~a (Spach) Steud.. N = 18; C. ?grcrnrli.s Ashe, N = 2) plus C. plrr~ctclta Jacq. (Crcrtoegus Sect. P~rtlctertar; N = 8). Further details con- cerning sampling sites and methods. and the composition of the sam- ple studied here, are given in Diekinson (1983) and Dickinson and Phipps ( 1 984. 1985).

Each Crrrtoeg~rs OTU was scored for a total of 17 characters (Dickinson 1983; Table 2 in Dickinson and Phipps 1985). For each OTU, reproductive characters were scored on 10-20 flowers and fruits, while leaf characters were scored on six to eight short shoot terminal leaves. Using the sample maxima and minima for each character, OTU means were rescaled to a (0.1) interval to ensure

TABLE I. Key to the data sets (X .) used in creating the 33 phenograms of 60 Crtrtrreglrs OTUs repre-

sented in Fig. 7

Multistate Continuous

Flowers. fruit PROJ WFL AN'TH LCAL TC AL LFR XI: PUB l WFR PUB?

Leaves X NUMSEC Y ANGSEC Z XI.

TEETH

x , xc. All" X.r

No'rti: Enlrica arc cha~~c tc rs . aoncd hy orfan systclii and hcalc ol' ~iieasurcmenl. See Dickinaon (1983) and Dickinhon ;lnd Phipps (1985) Ihr complete cxpl;~n;~tion of the ahhrcviations ;~nd corrc- spondingcharactcrh. XI = ohhcrvcd ch;~r;~cter-slalc lrcqucncy dihlri- hutions Ibr the characters in XI. plus S T A M and STYL.

"Including descriptors S T A M and STYL.

commensurability (condensation. Crovello 1968; ranging, Gower 197 Itr). Thcsc means made up the basic data matrix used in subsc- quent analyses.

'The experimental design of the study is shown in Fig. 3. In addition to the complete basic data matrix, X-r, four other data scts were assembled, each representing a subset of X.r. The characters werc divided according to whether they related to flowers and fruits or to leaves (Table I) . This division yicldcd two data matrices dcnotcd X,; (flower and fruit characters) and XI. (leaf characters). Another classi- fication was made based on whether characters were scored as con- tinuous or meristic variables. or as ordercd multistate ones (Table I) . The corresponding data matrices are denoted X(. and XM, respectively. Note that two flower characters: style number per flower (STYL) and stamcn number per flower (STAM) were omitted from X,: and Xc because of the way in which by themselves these two characters sufficed to assign all but the two OTUs of C . ?grclrltli.s to the correct taxon. Thc sixth data matrix. XI. diffcrs frorn the preccding ones as it contains observed character-state frequency distributions for all I I flowcr and fruit characters (sec Dickinson (1983) and Dickinson and Phipps ( 1985) for details).

For each of Xr, XI?. X,, Xc. and X, two distance matrices were calculated: a matrix of Euclidean distances (S . , ) and onc of Mahalanobis' generalized distances ( S . (;). Both matrices wcre pro- duced using program EUCD. Howevcr. in the latter case a R algo- rithm, principal components analysis was performed first. using the character covariance matrix. Between-OTU generalized distances were calculated from principal component scores for each OTU in- stead of from the OTU character means. In doing so. the differences bctwcen OTUs on each principal component were divided by the corresponding eigenvalue of the covariance matrix (Rohlf 1970; Orloci 1978). An eleventh resemblance matrix. of information radii among OTUs (SII). was caleulated using XI and the formulation given by Prentice ( 1979).

A total of 33 phenograms were obtained from these I I resemblance matrices using three sorting algorithms with well-known eharacter- istics (Sneath and Sokal 1973): single linkage (space-contracting), average linkage (space-neutral), and complete linkage (space- dilating). Values of the five dendrogram descriptors were calculated for each phenogram as described below, by program DENDAT (see Appcndix). These values were used to calculate a matrix of Euclidean distances among the 33 phenograms (expression 9. below). Relation- ships among the phenograms were then examined by means of min- imum variance clustering and PCoA.

Detrdrogrart~.~ c!fphjllogenctic re1atiot1.stzip.s Recently, Penny et al . (1982) published 39 dendrograms of

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 4: Comparison of dendrograms: a multivariate approach

2768 CAN. J . UOT.

Chi5 Chkz Cij = Cik I h Vhijk

c j k vj k

chi vhi

Cg,= Chh= Cii = Cjj = Ckk Vg Vh Vi Vj Vk

FIG. 4. Cophcnetic levels (C) and vertices ( V ) in a dendrogram.

I I mammalian taxa. based on evolutionary relationships deduced from thc amino acid sequences of five proteins considered singly and all together. Wc havc used their cladograms to illustrate the usefulness of th; method described below in detecting relationships between taxonomic structure ant1 data sourcc in a sample of phylogenetie trees.

Since the cladograms given by Penny cr trl. (1982) illustrate only topological relationships among the I I taxa. cophenetic differenccs among them could not be calculated. However. values of thc rc- maining four descriptors wcre obtained for each cladogram (program DENDAT) and used to calculate a matrix of Euclidean distances (expression 9. below) that was thc basis for a minimurn variancc clustering and PCoA ordination of the cladograms. Although Penny

trl. (1982) published a total of39 cladograms. two ofthcm (numbers 8 and 38) arc identical. 'The analyses described hcre were carricd out using only thc 38 diffcrcnt cladograms.

Method Dcndrograms can be treated as individut~l ob.jects. Conscqucntly.

dissimilarities among them can bc dcfined and a sct of dcndrograms may bc sub.jcctcd to ordination and clustering in much thc same way as any othcr group of cntities (scc c .g . , Schnell 1970: Phipps 1971 ). Dendrograms. however. arc objects whosc adequatc description rc- quires knowlcdgc of basic graph theoretical terms. Accordingly a brief summary of the relevant terminology is given prior to discussion of the methodological dctails.

Defit~i1iot1.s Let S = {st) be a set o f , j = 1 . . .,I objccts to be subjected to hier-

archical classification. Thc resulting dcndrogram is a connccted, un- directed graph in which cach tcrnlinal vertex r*, rcprcsents object s, (Fig. 4). The terminal vcrticcs are conncctcd by cdgcs through intcrior vertices. Thc degrce of a vertex is the number of edges bclonging to it, so that terminal vertices arc of degrce one.

A dendrogram is an unrooted binary trce if all interior vcrticcs havc degrec three. Examples are some evolutionary trees (c.g., Watcrman and Smith 1978). Numerical taxonomic studics arc frequently con- cerncd with thc construction of rooted trees. howcvcr. The root is usually a degree two vertcx corresponding to the last fusion (or first division) point in the clustering process. A rootcd binary trec or dendrogram has II - I interior vertices and 211 - 2 edgcs. The num- ber of interior edges is 11 - 2. In many cascs the degree of interior vertices and the root is higher than three or two. respectively. Such nonbinary dcndrograms are obtained if the fusion of rnorc than two clustcrs at one time is allowed, or in the case of divisive algorithms, if clusters are allowcd to split simultaneously into more than two groups. Additionally, a nonbinary tree will rcsult if a divisive clus- tering process is arrestcd at a spccified tcrmination point. Examples of nonbinary trees are dendrograms Q and R in Fig. 2.

There is a charactcristic numbcr associated with cach intcrior vcrtcx idcntifying thc hicrarchical lcvcl at which thc corresponding groups wcrc fuscd or dividcd. Thc lowcst hicrarchical Ievcl. that of thc tcrminal vcrticcs. is thc similarity of ob.jccts to thenlsclvcs. Bascd on these hicrarchical lcvcls thc trcc structurc of a dcndrogram may bc complctcly dcscribetl by an 11 X 11 matrix (cf. Hartigan 1967) herc dcsignatcd C . An clcmcnt of this matrix. (.,A. rcprcscnts the hier- archical level assigncd to the interior vertex in Fig. 2) closest to the root along the path from v, to v , . This value measures the eo- phcnetic difference (Sokal and Rohlf 1962). designatetl CD. In most cases c,, satisfies the criteria for being an ultrametric (Jardine and Sibson 1968). That is. for any i. , j . and k ( i f , j f k . i = I . . . r z ; j = 1 . . .,I; k = I . . .11) the following rclation holtls:

[ I ] c,, 5 max{c.,,. c,,)

Using other symbols, any triplet ot vertices can be labclled p, (1, and r so that

These relations may not hold true for all triplets if. for instance, ccntroid sorting algorithms are used and reversals occur in the dendro- gram, as illustrated by tlendrogram P in Fig. 2.

The renloval of an edge e from a dendrogram G I creates two subtrecs, G; and GI;. This operation divides the set of ob.jects into two subscts and represents the partition PI = {S;. SY) of S I . Let G I and G2 denote two dcndrograms to be conlpared. and fi~rther. let e l l and eZI be. respectively. edges of G I and G2. oI I ant1 P Z I i1l.e said to bc matchcd if thc partitions PI and PZ generated by their dcletion are equivalent. An cxamplc is shown in Fig. 2. whcrc cdgcs r,:., and el:, arc matchcd. In this rcspect interior edges have specific importance sincc thc terminal edgcs are necessarily matched for any G on S.

Finally. the minimum difference bctwecn two dcndrograms is defined in terms of trcc topology. If G I is a binary trcc. the mini- mum stcp ncccssary to obtain a topologically diffcrcnt tree G2 is the change of thc positions of two vertices that are two intcrior vcrticcs apart. For insts~ncc. the changc of and v , in clcndrogram A will result in dendrogram C (Fig. 2). This transformation is termed nearcst- neighbor interchange (Waterman and Smith 1978) or crossovcr (Robinson 197 1 ).

De.s(.riplor ~w. i ( rh le~ The most common approach to dcntlrogram conlparisons is to tlc-

finc a dcscriptor variablc cxpressing thc relative position of two ob- .jccts in thc hierarchy. Thesc valucs are writtcn in a rr X 11 symmetric matrix, hereaftcr called a descriptor matrix. Two dcndrograms arc thcn compared using any meaningful function bascd on thc rcspectivc elements of the corresponding n~atrices.

The first definition for descriptor variables was the conccpt of thc cophenetic difference (CD) describcd abovc. Thc product-moment correlation between C , and C Z was thcn calculatcd to measure lincar relationship betwecn G I and GI. Other coefficients such as rank correlations and stress functions arc also uscd (Jackson 1969; Cunningham and Ogilvic 1972). One advantage of this approach is that the dendrograms can be compared with thc original rcsemblance matrices, and the distortion implicd by thc dendrogram can bc mea- sured. However, the method has limitcd applicability sincc com- parison of the C matrices to the original rcsemblance matriccs is often meaningless, if not impossible (e.g. . in the casc of minimum variance clustering). Morcover. sincc only hicrarchical lcvcls arc considered, dendrograms with diffcrcnt topologies may prove as similar or more so than two which imply the samc hierarchy. For instance. consider dendrograms A. C. and J (Fig. 2): r,,,. = 0.976 but r,,] = 0.925, although dendrogl.ams A and J differ only in hierarchical levcls. If a distance coefficicnt is uscd the diffcrencc may bc cvcn grcatcr.

Another dcscriptor was suggested independently by scveral authors (Farris 1969; Phipps 1971; Williams and Clifford 1971 ). Instead of considcring Icvels. the vertices (or edges, whosc number will bc one greater) are countcd along the paths between all possible pairs of terminal vertices. This number is called the cladistic difference (Farris

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 5: Comparison of dendrograms: a multivariate approach

PODANI A N D DICKINSON

1969) or topological distance (Phipps 197 1 ) and is denoted by t,, for v, and v,. Since the use of the term cladistic is rcstrictcd to cladograms and thc usc of thc term distance may cause confusion with intcrvcrtex and interdendrogram distanccs, t,, will be called the topological differ- ence (TD). The descriptor matrix T for the dendrogram of Fig. 4 is given by

The topological difference is not an ultrametrlc, and there 1s not a one-to-one relationship bctwccn TD and the hierarchic classification (Rohlf and Sokal 1981). Two matriccs may be comparcd clement by clcmcnt using an appropriate resemblance function. The Manhattan metric, Euclidean distance, or the correlation coefficient arc most often used. I t is clear that a low topolog~cal difference docs not necessarily imply high similarity, and vice-versa. Another potentially undesirable property of topological difference is that a ncarest- neighbor interchange affects 211 - 4 values in the dcscriptor matrix. Comparisons employing TD cannot be made when the number of intcrior verticcs in G I and G2 is not the samc (i .c. , when one or both arc nonbinary).

Three new, alternative dcscriptor variables will now be introduced. Rohlf (1974) pointed out that one drawback of TD is that the location of thc root has little effect on dcndrogram resemblance. A morc fundamental problem with this dcscriptor is that i t gives equal wcight to interior verticcs. Clearly, the root should be the most weighted vertcx since i t represents the highest hicrarchical level. ~e r t i ccsc losc to the root should be given less wcight. and so on. Direct weighting using CDs would result in an improved dcscriptor. but one which would largely ignore topological relationships. However. the tree structurc will be exactly preserved if cach vertex is weighted according to the numbcr of objccts that are fused into the samc cluster at thc corresponding lcvcl and instcad of counting all vertices along the path between v, and v,, only the vertex with the largest wcight is consid- ered. Therefore. if W denotes the dcscriptor matrix to be defined, w,, will simply cqual the number of objccts in the smallcst cluster which contains both s, and s,. For the dendrogram in Fig. 4 matrix W will be given by

g h i j k 5 1 5 5 5 5 h 1 2 4 4

[4] W = i 1 4 4 .i 1 2 k I

The advantagc of this strategy over TD is clear from thc following cxamplc. If we consider objccts s,. s, and s, in the dcndrogram in Fig. 4 (verticcs v,, v,, and v,). the corrcsponding measures arc t,, = 3, t , ~ = 3, t,, = 3. w,, = 5, ,v,, = 5. and IV,, = 4. That is, this ncw dcscriptor takes into account the fact that cih < c,, and c.;, < (;,. whereas TD docs not. The "differencc" between s, and s, is less than w,, and I~I,,,, as rcquired. The hierarchical levels are not preserved because w,; = w,, but el,; < c,,. This cxamplc suggests that thc new descriptor is intermediate in its behavior between CD and TD. This is not to say that w,, is a better dcscriptor than thc othcrs in all circumstances. but i t docs have some properties that may provc advantageous. 'Thcsc include the following: (a ) w,, - I is ultra- metric under all conditions cvcn if reversals arc prcscnt and W has an exact trcc structurc; (I)) a nearest-neighbor interchangc affects only 2/11 - 4 values in W, wherc rn is thc wcight of the morc wcighted vertex of the interchangc in question; and (c.) i t follows from (h) that a nearest-neighbor interchangc near the terminal verticcs has less effect on dendrogram similarity than docs a change close to the root. The descriptor w,, will bc called thc cl~rster rilerilber.ship divergence (CMD).

The next descriptor suggested utilizes thc propcrty that any dcndrogram implics a scries of partitions. For cxamplc, thc dcndro- gram in Fig. 4 represents thc following sequcncc of nonhicrarchical classifications:

Excluding the trivial casc of one-group classification. but kceping the division into single objects, it is seen that thc numbcr of diffcrcnt partitions is 11 - I. However, this value is only a theoretical max- imum not reached by trces in which two or morc hicrarchical levels arc equivalent (as in dcndrograms A,B.C,D.H.I.J.K,L.M. and N in Fig. 2). The relative position of two objccts in the hierarchy can be expressed in terms of the numbcr of partitions in which thcy arc not assigncd together to a group. This value will be termed pcrrtition /ne/nhe,:ship tlivergence (PMD) and denoted by h,, for vj and v,. The dcscriptor matrix B for the dcndrogram in Fig. 4 will bc

g h i j k g o 4 4 4 4 h 0 1 3 3

[5] B = i 0 3 3 .i 0 2 k 0

Comparison with matrix W [4] immediately shows that in this cxam- ple there is only one substantial difference between the two dcscrip- tors. namely hl,, < /),A, whereas w,,, = w;,. That is. PMD prcscrvcs the ordering of hicrarchical levels although i t is a completely topological descriptor. PMD and CMD arc similar in that a nearest-neighbor interchangc affects only 2/71 - 4 values in the matrix (see above). PMD has the furthcr advantagc that the magnitude of the change of thcsc values depends on the numbcr of hicrarchical lcvels falling between the two lcvcls on which the interchange took place. In this way the hierarchical structurc is more explicit in PMD than in CMD. However, h,, is ultrametric only if there are no rcvcrsals in the dcndro- gram. A morc serious difficulty is that the comparison of G I and GZ is meaningless if the numbcr of hicrarchical levels. p l and pa. is diffcrcnt in G , and G2. A possible remcdy is to consider only the first p ' most important lcvcls or m i n ~ l r p 2 } upper levels in both dcndro- grams being comparcd. and ignore the othcrs. Fortunately. for quantitative data the chance for two levels to be equal is negligible, although in the casc of binary data that chance is rather highcr.

Thc third dcscriptor to be introduced characterizes dcndrograms on the basis of their interior graph structurc. In a binary trcc there arc 11 - 1 subtrecs (including the whole dcndrogram) which correspond to the 11 - 1 interior vcrticcs. Each of thcsc subtrecs rcprcscnts a hicr- archical classification of a subset of objccts. The position of two objects in thc hicrarchy relative to the others can bc described as thc numbcr of thcsc subclassifications in which thcy do not occur to- gether. This quantity will be called s~rOtree /rie/r~be,:ship clir,erger~ce (SMD) and denoted by I,, for v, and \)A. Matrix L for the dcndrogram in Fig. 4 clarifies the meaning of this dcscriptor.

g h i j k g 3 3 3 3 3 h 1 1 2 2

[6] L = i 1 2 2 .I I I k I

As seen, SMD differs fundamentally from the othcr four descriptors in that thc values of the main diagonal in L arc not ncccssarily equal. This is due to the fact that thc terminal vcrticcs arc not equidistant from the root. If t':, denotes the numbcr of edges between \I, and the root, I,, will be given by

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 6: Comparison of dendrograms: a multivariate approach

2770 CAN. J . BOT. \

FIG. 5. Artificial dcndrograms A - 0 of Fig. 2 in the space defined by the first two principal coordinate axes derived from matrices of dendrogram descriptors: (tr) topological difference; (17) cophcnctic differencc; (c) cluster membership divergence; ( d ) subtrec mcmbcr- ship divergcncc; (e) partition membership divergence; and ( / ) Eucli- dean distances calculated from all five descriptors. Clusters existing in the first threc dimensions are outlined. In Figs. 50-5f'. the cigen- values corresponding to the first three axes account for at least 67% of the sum of all 15 eigenvalucs.

That is, i f , , influences the actual value of the diagonal clement. This property may or may not be considered in defining a resemblance between two trees. I t may also be seen that the subtree membership divergcncc between v, and V L is

where t;, is the number of edges between the interior vertcx closest to the root along the path from v, and v L and the root itself. Owing to the inequality of the diagonal elements of L, SMD is not ultrametric, although all nondiagonal elements would satisfy the requirements even if reversals occur. For this reason, the complete tree structure is easily recovered from the L matrix. The use of Euclidean distance is suggested to compare G I and G2. provided that both dendrograms are binary. For nonbinary trees this descriptor becomes less efficient as the number of high-degree (>3) interior vertices increases.

Multiple t~otnpnri.son.s bused otl the combitled tlistnnce hetweet1 dendrogrtrms

The comparison of two dendrograms using the corresponding dc- scriptor matrices is univariate even though 17' - 11 paired elements are involved in the calculation of dendrogram resemblance. A cophenctic

correlation coefficient reflects similarity only in terms of CD. a dis- tance calculated from TDs is affected only by spccific topological relationships, and so forth. Consequently. the analysis (scaling or clustering) of a k x k matrix of rescmblancc coefficients among k dendrograms will also be univariatc in this sense. An analytical method may be sought which incorporates as many aspects of dcndro- gram description as possible. so as to make the comparison really multivariate, and hence more rcliablc. This goal may be accomplished by involving all five descriptor variables in thc definition of a distance among dcndrograms.

To avoid excessive differenccs in the importance of descriptor vari- ables, some manipulations of the data are ncccssary. CDs arc rcscalcd for cach dendrogram to fall within the interval {O. I}. The CMD values are readily standardized if divided by 11. The minimum number of partitions, p,,,,,,, in the set of dendrograms being analyzed is found. Then. only the upper p,,,,,, partitions arc considered for cach dendro- gram in calculating the PMD. Division by p,,,,,, results in a {0,1} range for this descriptor. SMD and TD are standardized by the global maxima. The maximum for the SMD is determined by counting all terminal vertex-root topological differenccs. finding the minimum and subtracting i t From 11. It may be shown that the maximum of the topological differenccs pertains to the terminal vertex whose SMD is the minimum. This fact considerably simplifies the computations.

Having standardized the descriptor variables, the combined Euclidean distance between dcndrograms G I and G2 is computed according to

where cach state of x(tr) represents a descriptor. The distance matrix D of k dcndrograms (S,,. in terms of Fig. I ) is subjected to ordination or clustering. The results of these analyses may yield information useful in interpreting the dcndrograms, whether or not discontinuities exist among alternative dcndrograms.

Results and discussion Artificicrl dendrogmms

The corresponding ordinations (Fig. 5) and classifications (Fig. 6) of the 15 artificial dendrograms are in good agreement with each other, although in a few cases ordination groups outlined in the scattergrams have been changed in the classi- fication. The most striking example of this is group {A,C,D,J} of Fig. 5 0 (TD), from which object C has been removed and placed in a distinct cluster in Fig. 60. In the remaining com- parisons such drastic differences are not observed

As expected, there are both similarities and substantial dif- ferences among the performances of the five dendrogram descriptors. These may be summarized as follows. (i) The most consistently recognized group is {A,C,J,K}, with the exception of T D (Figs. 50, and 60) as mentioned above. This suggests that nearest-neighbor interchanges affect all five descriptors considered simultaneously and all but one (TD) of the individ- ual descriptors in a similar way. (ii) The uniqueness of T D is also obvious if another aspect of dendrogram similarity is con- sidered. The relative closeness of dendrograms E and 0 in ordination (Fig. 50) and classification (Fig. 60) demonstrates that T D is insensitive to chain inversions, unlike the four other descriptors. (iii) SMD also produced some obviously unaccept- able groups. Dendrograms 1 and L, although quite dissimilar, have been assigned together in Figs. 5e and 6c. Also, H and I, which differ at the two-group level but imply similar three- cluster classifications, fall far apart (Figs. 5 e and 6e). (iv) The position of A and J in Figs. 5b and 6b demonstrates the prob- lem of hierarchic levels in cophenetic comparisons. Although A and J imply the same hierarchy, these are not the most similar

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 7: Comparison of dendrograms: a multivariate approach

PODANI A N D DICKINSON

A J D K

FIG. 6. Minimum variance cluster analysis of artificial dendrograms A - 0 of Fig. 2 based on: ( ( 1 ) topological difference; (b) cophenetic difference; ( c ) cluster membership divergence; ( d ) subtree membership divergence; (e) partition mernbcrship divergence; and ( J ) Euclidean distances calculated from all five descriptors.

dendrograms according to CD. (v) Three descriptors, CD, PMD are in best agreement with one another and with the CMD, and PMD yielded fairly consistent classifications at combined ordination (Figs. 5b-5d, 5f'). Despite the diffi- the four-group level: {A,C,D,J,K}, {M,N,O}, {B,H,I}, and culty with hierarchic levels, in this data set the CD-based {E,F,G} (Figs. 6b-64. The affiliation of L in these classifi- ordination suggests exactly the same group structure as the cations was ambiguous. Not surprisingly, this grouping also combined ordination (Figs. 50, 5f'). (vii) It may be concluded arose when all descriptors were considered simultaneously that the drawbacks of individual descriptors are diminished, (Fig. 6f ) . (vi) Similarly, ordinations based on CD, CMD, and whereas agreement among them is emphasized, if the various

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 8: Comparison of dendrograms: a multivariate approach

CAN. J. BOT. VOL. 62. I984

0 P * . A * .

dll,b,?d A m o o o m . t o P

&!U A q P A P 11

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 9: Comparison of dendrograms: a multivariate approach

PODANI A N D DICKINSON 2773

aspects of dendrogram comparisons are considered simulta- senting 15% of the sum of all 33 eigenvalues) relates to the neously. The relative arrangement and grouping of dendro- effect of sorting algorithm on the hierarchy of groups in a grams A - 0 in Figs. 5j'and 6f agree with intuitive ideas about phenogram. Thus the phenograms produced by single linkage their interrelationships. clustering have high scores on axis 11 (Fig. 70) and tend to

distinguish C. c r ~ ~ s - g ~ ~ l l i S. ~ t r . at one site in particular (solid Pherzogr~lms

The alternative Crat~leglis data sets used here (Table I) varied in the extent to which they produced phenograms con- gruent with an a priori classification of the sample according to site and taxon (Figs. 7d-7h). In general, the flower and fruit (XF) and the multistate (XM) characters tended to vary consid- erably between both sites and taxa while varying much less within these groups (Dickinson 1983; Dickinson and Phipps 1985). Conversely, the continuous (Xc) and leaf (XL) charac- ters were much more variable within groups, and so differ- entiated those groups much less sharply (Dickinson 1983; Dickinson and Phipps 1984, 1985). Multivariate comparisons of phenograms derived from these data sets enabled visual- ization of the way in which differences among resemblance

stars, Figs. 7d-7h) from the remaining samples of this taxon and C. fontanesiana. The corresponding average linkage phe- nograms present a simpler characterization of the sample of OTUs in which the samples from each site are grouped uni- formly by taxon (Fig. 7d). Complete linkage clustering fre- quently imposed a hierarchy corresponding to differences in style and stamen numbers, at the highest level (Fig. 7e).

The third PCoA axis (not shown; 9% of the sum of all 33 eigenvalues) corresponds to a contrast between the three phenograms based on generalized distances calculated for all 17 characters (DTCS, DTGA, and DTGc) and the six based on both Euclidean and generalized distances calculated for the six leaf characters (DLES, DI.EA, DLEC, DLGS, Dl.<;A, and DLCC). In the space defined by the second and third PCoA axes cluster I

coefficients and sorting algorithms were superimposed on these forms the most dense group of phenograms, close to the origins contrasts among data sets (Figs. ~ L I - 7 ~ ) . of the two axes. In the same space clusters 11, 111, and 1V are

Calculation of Euclidean distances among the 33 Crataegus increasingly spread out. This spreading, or lack of congruence, phenograms based on five dendrogram descriptors (SI,, Fig. 3), corresponds to increased variation in the composition and dis- and minimum variance clustering, led to recognition of four tinctness of groups, and in their hierarchical relationships. It is clusters of phenograms (I-IV). In the phenograrns of the first indicative of the decreasing stability of classifications as char- cluster all or virtually all OTUs were sorted according to site acter set, resemblance function and sorting algorithm con- and taxon (Figs. 7d and 7e). In contrast, for the phenograms in the other three clusters, especially those of cluster IV, the sorting of OTUs was much less precise (compare Figs. 7f-7h with Fig. 7tl and 7e). Sorting of OTUs according to species and especially section (Crus-galli versus Punctatae) was gen- erally more successful (Figs. 7d-7h).

The first PCoA axis (associated with an eigenvalue repre- senting 26% of the sum of all 33 eigenvalues of the distance matrix SD) corresponds to a trend in the sample from pheno- grams showing very similar, sharply defined groups (cluster l , Fig. 7a) to those in which groups are less well defined and of more variable composition, as a result either of chaining (single linkage phenograms, cluster 1V; Fig. 70) or of the difference between Euclidean and generalized distances (Figs. 70 and 7c). With Euclidean distances the sharpness of group structure may be enhanced by the pattern of correlations among the characters used. Generalized distances, however, are calcu-

tribute to weakening group structure. In the space defined by the first two PCoA axes (Figs. 7a-7c) this weakening can be seen to be a function of not only continuous and leaf characters, generalized distances and the space-contracting effect of single linkage clustering but also the interaction among these factors.

The way in which the flower and fruit data gave rise to phenograms that reproduced the LI priori classification of the sample is not surprising, in view of the heavy use made of flower and fruit characters in Crat~legus taxonomy (Phipps and Muniyamma 1980; Dickinson and Phipps 1985). Conversely, the variability of leaf characters, or of exclusively continuous ones, makes group recognition based on these data sets more dependent on choices of method.

Comparison of the results of using different sorting algo- rithms here and elsewhere (Dickinson 1983; Dickinson and Phipps 1985) also indicates some possible drawbacks associ- ated with space-dilating methods such as complete linkage and

lated in a manner independent of character correlations (here, minimum variance. These methods tend to form a relatively from scores on mutually uncorrelated principal component small number of groups the distinctness of which may be exag- axes). Phenograms produced from such distances thus may gerated, if comparison is made with ordinations or other infor- show correspondingly weaker group differentiation, as in the mation (Sinnott and Phipps 1983). Moreover, to the extent that present case. OTUs are forced into a small number of perhaps artificially

The second PCoA axis (associated with an eigenvalue repre- distinct groups, these groups may be more heterogeneous

FIG. 7. Multivariate comparison of Croinegrrs phenograms. (o-c) Thirty-three phenograms of 60 Crrlmegus OTUs in the space defined by the first two ~r inci~al coordinate axes for the matrix of Euclidean distances calculated from five dendro~ram descri~tors. Shading indicates

1 . - clusters of dendrograms (I-1V) produced by minimum variance clustering using the same distances. Resemblance functions are indicated by the symbols used: information radius, +; Euclidean distance, 0; and generalized distance, 0. In (a), solid lines enclose dendrograms produced by single linkage (S), average linkage (A), and complete linkage (C) clustering. In (b), solid lines enclose dendrograms based on Euclidean distances calculated from the data sets described in Fig. 3 and Table I : all 17 characters (T); nine flower and fruit characters (F); six leaf characters (L); seven multistate characters (M); and eight continuous characters (C). In (c), solid lines enclose dendrograms based on generalized distances calculated from the data sets described in Fig. 3 and Table I ; abbreviations as in (b). In Figs. 5d-5h. selected dendrograms corresponding to the points in Figs. 50-5c labelled d, e, j; g, and h: (d) flower and fruit characters, Euclidean distance, average linkage (D,;,,); (e) information radii, complete linkage (Dl,,); (J ' ) flower and fruit characters, generalized distance. single linkage (D,;(;S); (g) continuous characters, Euclidean distance, complete linkage (DcEc); and (h) leaf characters, Euclidean distance, single linkage (D,,,). In Figs. 7d-7h terminal vertices represent OTUs whose taxonomic affiliation is indicated by the associated symbols: C. crrrs-gnlli s. sir-.. solid symbols; C.Joniunesinnn. open circles and squares; C. ?grandis, open stars; and C. punctom, the letter P (one site only). Symbol shape (O , 0 . A , .it) identifies four collecting sites. See Dickinson (1983) and Dickinson and Phipps (1985) for complete details concerning sampling and character sets used.

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 10: Comparison of dendrograms: a multivariate approach

2774 CAN. J . BOT. \

than is desirable. This is illustrated by the way in which 10-stamen Crataegus OTUs (C. c r~ i s -g~~l l i s. str.) were dis- tinguished from 20-stamen ones (C. forlt~lnesi~lna, C . ?grandis and C. p~l~zct~ita) in some complete linkage phenograms (Fig. 7e).

Finally, the degree of congruence among the phenograms based on flower and fruit characters, regardless of whether based on Euclidean (cluster 1; Figs. 70 and 7d) or generalized distances (cluster 11; Figs. 7c nd 7 f ) , suggests that these char- acteristics are unlikely to be controlled by only a few pleio- tropic genes (Atchley et al. 1982). Instead, there could be an appreciable genetic component to the phenotypic differenti- ation observed among sites in taxa belonging to C. crus-galli s. 1. (Dickinson 1983; Dickinson and Phipps 1985).

Cladograms Penny et al. (1982) argued that the congruence of evo-

lutionary trees (cladograms) constructed from the amino acid sequences of different proteins by maximum parsimony methods (implying minimum evolution) for a common set of taxa or evolutionary units (EUs) should be a criterion by which the theory of evolution could be tested. They proposed that the degree of congruence of cladograms so constructed should be compared with that expected if the cladograms were selected at random. For nine different values of the measure of congru- ence that they used (Robinson and Foulds 1979), Penny et al. (1982) were able to calculate expected frequencies, and to compare these with the observed frequencies in a sample of 39 minimal and near minimal (maximally parsimonious) clado- grams for I I mammals. Penny et al. (1982) found that the degree of congruence among the 38 different cladograms was considerably greater than expected for randomly generated cla- dograms. They concluded that "the different protein sequences give trees that are markedly similar, showing a relationship among them that is consistent with the theory of evolution."

In the present study single, average and complete linkage cluster analyses grouped all or most of the 38 different clado- grams produced by Penny et al. into three to five principal groups in the same way as did minimum variance clustering (clusters I-V, Fig. 8), in each case based on Euclidean dis- tances calculated from four dendrogram descriptors. 'The five clusters formed by both minimum variance and average linkage sorting each consisted principally of the cladograms obtained from one class of protein sequences (Fig. 8). This corresponds to the finding by Penny et al. (1982, Table 4) that the average distance between cladograms derived from the same kind of sequence was smaller than that between those derived from different sequences.

Subjecting the matrix of Euclidean distances among the 38 cladograms to PCoA (Fig. 8a) demonstrated a contrast along the first axis (not shown; associated with an eigenvalue representing 27.4% of the sum of the 38 eigenvalues of the distance matrix) between cladograms derived from cytochrome c sequences in which a few clades, each consisting of several EUs, are present (cladogram 17, Fig. 8e) and ones derived from fibrinopeptide sequences in which chaining is conspic- uous and all or most EUs form a clade by themselves (clado- gram 24, Fig. 8c). This pattern resembles that seen along the first PCoA axis in comparisons of the 33 Crataegus pheno- grams (Fig. 7).

Contrasts along PCoA axes two through four are associated more with alternative phylogenies than with the presence or absence of chaining. 'The relative positions in the cladograms

of dog (D) and horse (E) appear to account for the separation of clusters 111 and IV, consisting of all but two of the clado- grams derived from hemoglobin A and B sequences, from the rest of the sample (Fig. 80; compare Figs. 80 and 8d with Figs. 8c, 8e-8g; Penny et al. 1982). It is of interest that an examination of variation in the relative position of dog between the cladograms based on hemoglobin sequences (cladograms 27-39) and the remainder of the sample has led D. Penny and M. D. Hendy (unpublished material) to suggest checking dog hemoglobin sequences for possible errors. The method employed here thus has the additional advantage that it too may indicate which data in a study require verification, if they result in unusually incongruent dendrograms. Separation of cluster 111 from the other clusters along the fourth PCoA axis is associated with the relative position of mouse (M) in the cladograms (compare 36, Fig. 8d, with Figs. 80, 8c, 8e-8g).

Our results demonstrate how the multivariate comparison of cladograms, in this case, can greatly facilitate summarization and interpretation of the contrasts among a large number of complex alternatives. Moreover, these results make it clear that while the 38 different minimal and near-minimal cladograms obtained by Penny et L I ~ . (1982) are quite similar, it can also be argued that particular proteins support rather different phyloge- nies than do others. Whether these differences are sufficient to constitute a falsification of the prediction made by Penny et LII. (1982; that different proteins in related organisms should sup- port similar phylogenetic schemes) is unclear. We suggest, however, that in view of the very large numbers of alternative phylogenetic hypotheses (cladograms) generated by different automatic reconstruction methods, data sets and coding schemes (Colless 1983), comparisons using a method like the one described here are needed to detect in the results patterns such as we have demonstrated (Fig. 8a) in those of Penny et al. (1982).

Methodological remarks The multivariate technique suggested for the analysis of

dendrograms utilizes five descriptor variables. Other possibili- ties for D-D comparisons, such as those offered by ultrametric relationships, nearest-neighbor interchanges, edge removals, etc. were deliberately neglected in constructing the distance coefficient [9]. It is of course conceivable that as many aspects of dendrogram structure as possible could be considered by incorporating all direct and indirect techniques into a single method. An important point, however, is that the calculation and standardization of dendrogram descriptors is a fast pro- cedure, whereas most direct techniques (e.g., ultrametric dis- similarity) are much more time-consuming. A further difficulty is that no efficient algorithm is available for calculating nearest- neighbor interchanges, and the use of existing ones is restricted to relatively small dendrograms (Day 1983; Jarvis et al. 1983). Conversely, for the dendrogram descriptors the upper limit of dendrogram size and number is determined only by com- puter memory. Despite the problems it would be worthwhile in the future to examine whether or not features of dendrogram structure not incorporated in the approach given here in fact give substantial additional information about the relationships among alternative dendrograms.

A serious objection to the use of dendrogram descriptors might be that the cells of a descriptor matrix are not indepen- dent (Farris 1969, 1973). For this reason the descriptor matrix cannot be used to assess the statistical significance of the com- parison. (Gower (19710) suggested a technique which avoids

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 11: Comparison of dendrograms: a multivariate approach

PODANI AND DICKINSON A CA A H A R M D O P S C E K H A R M O S C P E D K H A R D S C P E O M K

A fp& A H A R M O S C P E D K H A R O M S C P E D K H A R O M E P C S D K

FIG. 8. Multivariate comparison of 38 different cladograms for 1 1 mammals published by Penny er rrl. (1982). (rr) Cladograms in the space defined by the second and third principal coordinate axes for the matrix of Euclidean distances calculated from four dendrogram descriptors (TD, CMD, PMD, and SMD). Shading indicates clusters of cladograms (I-V) produced by minimum variance clustering using the same distances. Cladograms are coded according to the protein sequences on which they are based: 0. cytochrome c (cladograms 12- 17); A, fibrinopeptide (cladograms 18-26); Q, hemoglobin (cladograms 27-39); and 0, data for all proteins combined (cladograms I - I I). Open symbols indicate the A form, solid symbols the B form of the corresponding proteins. The combination symbol (solid circle, open lozenge) represents cladograms 8 and 38, which are identical. The solid lozenge represents cladogram 7 which is identical to the consensus cladogram obtained by Penny er ol. (1982). Figs. 8b-8g. Selected cladograms for I I mammals published by Penny ern/. (1982) which illustrate extrema in the principal coordinates analysis shown in ( a ) . Key to taxa: H , human; A, ape; R, Rhesus monkey; 0 , rabbit; M. mouse; D, dog; P, pig; C, cow; S. sheep; E, horse; K, kangaroo. Cladograms are based on data for (b) hemoglobin A, (r.) fibrinopeptide B. (d) hemoglobin B, (e) cytochrome C, (f') cytochrome c , and (g) all proteins.

this problem, but the distributional properties of his R' statistic Another potential criticism is that the five descriptors are corre- are unknown.) However, any statistic based on descriptors can lated, and that this interdependence remains uncontrolled still be used as an indicator of dendrogram congruence by during the analysis. Correlation is present in practically every keeping in mind that the differences in magnitude observed multivariate situation, and guarantees that agreements among are interpretable only in the framework of the given study. different descriptors will tend to be emphasized, rather than

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 12: Comparison of dendrograms: a multivariate approach

2776 CAN. J . DOT. VOL. 62. 1983

disagreements. Thus, for the descriptive purposes illustrated here, correlation of descriptors is in fact an advantage.

Because of the nature of the descriptors, the method pre- sented here is applicable only to strictly binary dendrograrns. For this reason, dendrograms differing in the number of interior vertices and thus implying unequal numbers of partitions and subtrees should not be compared in this way, since the dis- tortion occurring in the results would be unpredictable. Nevertheless, experience with many data sets suggests that if the number of objects is much higher than the number of dendrograrns, a few tri- or poly-furcations at low hierarchic levels may be allowed without jeopardizing the interpretability of the final results (e.g., J. Podani, unpublished study of 80 phytosociological units). We admit that the condition of binarity represents a serious limitation since in many practical situations hierarchies are neither detailed nor of great impor- tance at low levels. Undoubtedly our technique could be modified so as to obtain a more widely applicable tool for dendrograrn comparisons.

Conclusions The three new descriptors of dendrograrn structure that have

been introduced here (CMD, SMD, and PMD) together with the two existing ones (CD and TD) make possible multivariate comparisons of dendrograrns. Analysis of simple artificial den- drograms revealed striking differences among these five de- scriptors. TD, a descriptor currently used in many comparative studies (e.g., Duncan et al. 1980; Lachance and Starmer 1982; Hopper and Burgman 1983; cf. also the relationship between TD and some consensus indices), provided the most unique characterization of dendrograrn resemblance. Comparisons based on SMD also differed from those based on the other descriptors. .The remaining three descriptors tended to be in greater agreement with one another, and this agreement turned out to have a decisive influence on the results based on all five descriptors. Of course, far-reaching conclusions should not be drawn from this observation since dendrograrns A - 0 were arbitrarily selected from the set of all possible ones. Never- theless, we feel that this example demonstrates the greater objectivity of simultaneous comparisons of dendrograrns based on all five descriptor variables.

The relevance of multivariate dendrograrn analysis to the

study of current numerical taxonomic problems is clear from the results of its application to two real data sets. The Cra- taeglts study exemplified how illuminating our approach may be when examining the relative importance of choices asso- ciated with paths a, b, and c (Fig. I). The experimental design of the study (Fig. 3) was deliberately made rather complex to demonstrate the possibilities present in a single analysis. Thus, the effects on the results of cluster analyses of character suite, scale of measurement, dissimilarity coefficient, and sorting algorithm were examined simultaneously. Character suite, as it related to the partitioning of variation between and within tax- onomic groups in the sample, turned out to be the most im- portant single factor (Fig. 7). In the same way, scale type and sorting algorithm were also found to be important. The choice of resemblance function in this case appeared to be much less critical.

The analysis of cladograms showed that even if no signifi- cant differences exist among the trees obtained by Penny et 01. (1982), trends determined by the nature of the original input data and not otherwise revealed are nevertheless present, and of some interest (Fig. 8).

We conclude that an optimal survey of any kind of trees should include both significance tests and multivariate com- parisons. Such an approach may prove extremely rewarding, regardless of whether or not there is reason to hypothesize the existence of a single, true dendrograrn representation of the sample.

Acknowledgements For supervision of the theses on which the work presented

here is based and for financial support from their NSERC operating grants, the authors are deeply indebted to L. Orl6ci (J.P.) and J. B. Phipps (T.A.D.). Each of us also gratefully acknowledges financial support from an Ontario Graduate Scholarship. We thank P. G. Smith, J. McNeill, and an anon- ymous reviewer for their comments and suggestions made on drafts of this paper. We thank D. Penny and M. D. Hendy of Massey University, Palmerston North, New Zealand, and L. R. Foulds of the University of Florida at Gainesville for allowing us to use some of the illustrations from their paper as our Figs. 8b-8g. These are reproduced by permission from Nature Vol. 297, No. 5863, page 198. Copyright 0 1982 Macmillan Journals Limited.

Appendix The Fortran program DENDAT has been written to prepare data for the multivariate analysis of alternative dendrograms. I t is assumed that

all dendrograms being analyzed are binary and that there is no reversal at the highest hierarchical level in any dendrogram. The input data file for this program consists of a series of merge matrices. Each merge matrix describes the fusion sequence in a dendrogram

in terms of m - I rows, where m is the number of objects. In each row there are five values: the first two refer to the identification numbers of the clusters being amalgamated, the second two indicate the number of objects in these clusters, while the fifth value is the hierarchical level at which the two clusters are fused. For example, the merge matrix for dendrogram A in Fig. 2 will be

Note that the identification number of a cluster always corresponds to the smallest of the identification numbers of its component objects. The recent versions of the classification programs NCLAS, HMCL, and INFCL of the SYN-TAX package (Podani 1984) include a subroutine which automatically creates a data file with the merge matrix as soon as the classification process is terminated. Alternatively, the merge matrix could be prepared by hand.

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 13: Comparison of dendrograms: a multivariate approach

PODANI A N D DICKINSON 2777

Thc user of DENDAT can choose which of thc five dcscriptors will bc calculatctl by specifying valucs for variablcs I 1-15, Hicrnrchical lcvcls must bc includcd in thc mcrgc matrix cvcn if cophcnctic diffcrenccs arc not rcqucstcd. As in thc casc of thc cladograms analyzcd in this paper, dummy valucs rcflccting thc rclativc magnitude of thc hierarchical lcvcls arc uscd whcn prccisc valucs arc unavailablc.

Othcr input paramctcrs to bc provided by thc user arc thc numbcr of objects in cach dcndrogram (M), thc numbcr of dcndrograms (NC), the numbcr of high-lcvcl partitions to bc rctaincd for calculation of PMD (KPART; dcfault = M - I ). and thc lowcst valuc in thc dendrogram (BAS; thc diagonal valuc of thc rcscmblancc matrix among ob.jccts, usually cithcr zero or unity).

Thc output is cithcr a distancc matrix for dcndrograms (MOPT = 0 ) or a data matrix from which thc distancc matrix will bc calculated by another program (MOPT = 1 ). In thc first case thc uppcr half of thc NC x NC distancc matrix. including thc principal diagonal. is output by columns. In thc second, output consists of an I x (M x (M - I ) / ? _ ) (rows) x NC(columns) matrix. output by rows, whcrc I is thc numbcr of dcscriptors calculatcd. In both cascs thc descriptor variables arc standardized as dcscribcd earlicr. Printed output from DENDAT lists thc values of all input paramctcrs plus thosc of thc constants used to standardize thc dcscriptors.

Thc listing of program DENDAT and somc input-output cxamples arc available from thc authors on ~.cqucst.

ADAMS, E. N. 1972. Conscnsus tcchniqucs and thc comparison of taxonomic trccs. Syst. Zool. 21: 390-397.

ATCHLEY. W. R., E. V. NORDHEIM, F. C. GUNSETT, and P. L. CRUMP. 1982. Gcomctric and probabilistic aspects of statistical distancc functions. Syst. Zool. 31: 445-460.

COLLESS, D. H. 1983. Wagncr trccs in thcory and practice. 111 Numcr- ical taxonomy. Etlir~tl by J. Fclsenstcin. Springer-Vcrlag, Bcrlin and Hcidclbcrg. pp. 259-278.

CROVELLO. T . J. 1968. Thc cffcct of alteration of tcchniquc at two stages in a numerical taxonomic study. Univ. Kans. Sci. Bull. 47: 761 -786.

CUNNINGHAM, K. M.. and J. C. OCILVIE. 1972. Evaluation of hicr- archical grouping tcchniqucs. A preliminary study. Comput. J. 15: 209-213.

DAY, W. H. E. 1983. Propertics of thc ncarcst-neighbor interchange metric for trccs of small sizc. J . Thcor. Biol. 101: 27.5-288.

DICKINSON, T. A. 1983. Crnrn~grrs c.rr1.s-gcrlli L. sprlsrr lcrro in south- ern Ontario: phcnotypic variation and variability in rclation to reproductive behavior. Ph.D. thcsis, Univcrsity of Wcstcrn Ontario, London. (National Library of Canada Microfichc TC 58716.)

DICKINSON, T. A , , and J. B. PHIPPS. 1984. Studies in C ~ ~ I I ~ I P ~ I I S (Rosaceac: Maloidcac). IX. Short-shoot leaf hetcroblasty in Crtl- tnegus crus-gtrlli serlsrr Inro. Can. J. Bot. 62: 1775 - 1780.

1985. Studies in Crrrmegrrs L. (Rosaccac: Maloidcac). XIII. Degrec and pattcrn of phcnotypic variation in Crrrrtregus section Crlrs-gnlli in Ontario. Syst. Bot. 10. In prcss.

DIE=, E. J. 1983. Permutation tcsts for association bctwecn two distancc matriccs. Syst. Zool. 32: 21 -26.

DOBSON, A. J. 1975. Comparing thc shape of trccs.Lcct. Notcs Math. 452: 95 - 100.

DOUGLAS, M. E., and J. A. ENDLER. 1982. Quantitative matrix com- parisons in ccological and evolutionary invcstigations. J. Thcor. Biol. 99: 777-795.

DUNCAN, T., and B. R. BAUM. 1981. Numcrical phenctics: its uscs in botanical systcmatics. Annu. Rcv. Ecol. Syst. 12: 387-404.

DUNCAN, T. , R. B. PHILLIPS, and W. H. WAGNER, JR. 1980. A comparison of branching diagrams dcrivcd by various phcnctic and cladistic mcthods. Syst. Bot. 5: 264-293.

FARRIS, J. S. 1969. On the cophcnctic corrclation cocfficicnt. Syst. Zool. 18: 279-285.

1973. On comparing thc shapes of taxonomic trccs. Syst. Zool. 22: 50-54.

GILMARTIN, A. J. 1974. Variation within populations and classi- fication. I. Taxon, 23: 523-536.

1980. Variation within populations and classification. 11. Patterns of variation within Asclcpiadaccac and Umbcllifcrac. Taxon, 29: 199-2 10.

COWER, J. C. 197 1tl. A general cocfficicnt of similarity and some of its properties. Biometries, 27: 857-87 1.

1971b. Statistical mcthods of comparing diffcrcnt multi- variate analyscs of thc samc data. 111 Mathematics in thc archaeo- logical and historical scicnccs. Etliretl bjl F. R. Hodson, D. G. Kendall, and P. Triutu. Univcrsity of Edinburgh Prcss, Edinburgh. pp. 138-149.

HARTIGAN, J . A. 1967. Rcprcscntation of similarity matriccs by trccs. J. Am. Stat. Assoc. 62: 1 140- 1 158.

1975. Clustcring algorithms. John Wilcy & Sons. Ncw York. HAWKSWORTI-I, F. G . , G. F. ESTABROOK, and D. J. ROGERS. 1968.

Application of an information thcory modcl for character analysis in thc gcnus Arcc~11thol~i1rrr1 (Viscaccac). 'Lixon. 17: 605-6 19.

HOPPER, S. D., and M. A. BURGMAN. 1983. Cladistic and phcnctic analyscs of phylogcnctic relationships among populations of Errc~tr1yp1~r.s ctr~sicr. Aust. J. Bot. 31: 35-49.

JACKSON. D. N. 1969. Comparison of classifications. 111 Numcrical taxonomy. Etlirecl 1 ) ~ A. J . Colc. Academic Prcss, Ncw York. pp. 91-1 13.

JARDINE, N., and R. SIBSON. 1968. 'I'hc construction ol'hicrarchic and nonhicrarchic classifications. Comput. J . 11: 177- 184.

1971. Mathematical taxonomy. John Wilcy & Sons. Ncw York.

JARVIS, J. P.. J . K. LUEDMAN, and D. R. SHIER. 1983. Commcnts on computing thc similarity of binary trccs. J. Thcor. Biol. 100: 427-433.

JENSEN, R. J . 1983. Rcport on thc Sixteenth lntcrnational Nurncrical Taxonomy Confcrcncc. Syst. Zool. 32: 83-89.

LACHANCE, M.-A,, and W. 7'. STARMER. 1982. Evolutionary signifi- cance of physiological relationships among ycast communities associated with trccs. Can. J . Bot. 60: 28.5-293.

MICKEVICH, M. F. 1978. Taxonomic congrucncc. Syst. Zool. 27: 143-158.

O R L ~ C I . L. 1978. Multivariatc analysis in vegetation rcscarch. Dr. W. Junk bv Publishers. Thc Haguc.

PENNY, D. , L. R. FOULDS, and M. D. HENDY. 1982. Testing thc thcory of evolution by comparing phylogcnctic trccs constructed from five diffcrcnt protein sequcnccs. Naturc (London), 297: 197-200.

PHIPPS, J. B. 197 1. Dcndrogram topology. Syst. Zool. 20: 306-308. PHIPPS, J. B., and M. MUNIYAMMA. 1980. A taxonomic rcvicw of

Crrrtcr~gus (Rosaccac) in Ontario. Can. J. Bot. 58: 1621 - 1699. PODANI, J. 1980. SYN-TAX: computer program package for cluster

analysis in ccology, phytosociology and taxonomy. Abstr. Bot. (Budapest), 6: 1 - 158.

1982. Spatial proccsscs in the analysis of vcgctation. Ph.D. thcsis, Univcrsity of Wcstcrn Ontario, London. (National Library of Canada Microfiche TC 56124.)

1984. SYN-TAX 11. Computcr prograrns for data analysis in ccology and systcmatics. Abstr. Bot. (Budapcst). In prcss.

PRENTICE. H. C. 1979. Numcrical analysis of infraspccific variation in European Silcr~e nlbtl and S. tlioictr (Caryophyllaceac). Bot. J. Linn. Soc. 78: 181-212.

ROBINSON, D. F. 1971. Comparison of labcllcd trccs with valcncy thrcc. J. Comb. Theory, 11: 105- 119.

ROBINSON, D. F., and L. R. FOULDS. 1979. Comparisons of wcightcd labcllcd trccs. Lcct. Notcs Math. 748: 119- 126.

ROHLF, F. J. 1970. Adaptive hicrarchical clustering schcmcs. Syst. Zool. 19: 58-82.

1974. Mcthods of comparing classifications. Annu. Rcv. Ecol. Syst. 5: 101 - 1 13.

1982. Conscnsus indices for comparing classifications. Math.

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.

Page 14: Comparison of dendrograms: a multivariate approach

2778 CAN. J . BOT. VOL. 62. 1984

Biosci. 59: 13 1 - 144. ROHLF. F. J . . and R. R. SOKAL. 1981. Comparing numcricnl tax-

onomic studics. Syst. Zool. 30: 459-490. SCHNELL. G. D. 1970. A phcnctic study of thc subordcr Lari

(Avcs). 11. Phcnograms. discussion and conclusions. Syst. Zool. 19: 264-302.

SEAMAN. F. C . . and V. C . FUNK. 1983. Cladistic analysis ofconiplcx natural products: dcvcloping transformation scrics from scsqui- tcrpcnc lactonc data. Taxon, 32: 1-27.

SINNOTT. Q. P., and .I. B. P ~ I I P P S . 1983. Variation pattcrns in Crcrtrreg~rs scrics Pr~ri~losrre (Rosaccac) in southcrn Ontairo. Syst. Bot. 8: 59-70.

SMITH. P. G . , and J. B. PHIPPS. 1984. Consensus trccs in phcnctic

analyscs. Taxon. 33(4). In prcss. S N E A T ~ I . P. H. A.. and R. R. S ~ K A L . 1973. Numerical taxonomy.

W. H . Frccman & Co.. San Francisco. SOKAL. R. R. , and F. J . Rot-ILF. 1962. 'Thc con~parison of dcndro-

grams by ob.jcctivc mcthods. Taxon. 11: 33-40. WARD. W. H. , JR. 1963. Hierarchical grouping to optimizc an objcc-

tivc i'unction. .I. Am. Stat. Assoc. 58: 236-244. WATERMAN. M . S. . and 'T. F. S M I T ~ I . 1978. On thc similarity of

dcndrograms. .I. Thcor. Biol. 73: 789-800. WILLIAMS. W. T.. and H . T . CLIFFORI). 197 1. On thc comparison

of two classifications of thc samc sct of clcmcnts. Taxon. 20: 5 19-522.

Can

. J. B

ot. D

ownl

oade

d fr

om w

ww

.nrc

rese

arch

pres

s.co

m b

y U

nive

rsity

of

P.E

.I. o

n 11

/18/

14Fo

r pe

rson

al u

se o

nly.


Top Related