zinc finger gene clusters and tandem gene duplication

18
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 9, Number 2, 2002 © Mary Ann Liebert, Inc. Pp. 429–446 Zinc Finger Gene Clusters and Tandem Gene Duplication MENGXIANG TANG, 1 MICHAEL WATERMAN, 2;3 and SHIBU YOOSEPH 2 ABSTRACT Zinc nger genes in mammalian genomes are frequently found to occur in clusters with cluster members appearing in a tandem array on the chromosome. It has been suggested that in situ gene duplication events are primarily responsible for the evolution of such clusters. The problem of inferring the series of duplication events responsible for producing clustered families is different from the standard phylogeny problem. In this paper, we study this inference problem using a graph called duplication model that captures the series of duplication events while taking into account the observed order of the genes on the chromosome. We provide algorithms to reconstruct a duplication model for a given data set. We use our method to hypothesize the series of duplication events that may have produced the ZNF45 family that appears on human chromosome 19. Key words: gene duplication, algorithms, evolutionary history. 1. INTRODUCTION Z inc nger (ZNF) genes code for proteins that contain one or more zinc nger motifs. The zinc nger motif is one of the most common motifs involved in nucleic acid–protein interaction. This motif was rst discovered in the transcription factor TFIIIA in Xenopus (Miller et al., 1985). Since then, the database of ZNF genes has been increasing rapidly. Of the various zinc nger motif types, the one most commonly occurring is the C2H2 type. In fact, it is estimated that mammalian genomes contain between 600 and 1,000 genes that code for C2H2-type zinc nger proteins (Hoovers et al., 1992; Shannon et al. , 1998). Experimental studies on functions of these ZNF genes suggest that many of them code for transcription factors, and some of them are known to take part in cellular growth and development (El-Baradi and Pieler, 1991). However, the biological functions of most of these ZNF genes are currently unknown. The various sequencing projects reveal an interesting picture regarding the organization of these ZNF genes in mammalian genomes. In mammals, ZNF genes tend to occur as clusters (families) scattered throughout the various chromosomes. For instance, studies (Ashworth et al. , 1995; Shannon et al., 1996) on human chromosome 19 (H19) indicate that there are at least 100 ZNF genes on this chromosome; in 1 ACLARA BioSciences, 1288 Pear Street, Mountain View, CA 94043. 2 Celera Genomics, 45 West Gude Drive, Rockville, MD 20850. 3 Department of Mathematics, University of Southern California, Los Angeles, CA 90089. 429

Upload: shibu

Post on 24-Mar-2017

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Zinc Finger Gene Clusters and Tandem Gene Duplication

JOURNAL OF COMPUTATIONAL BIOLOGYVolume 9, Number 2, 2002© Mary Ann Liebert, Inc.Pp. 429–446

Zinc Finger Gene Clusters and TandemGene Duplication

MENGXIANG TANG,1 MICHAEL WATERMAN,2;3 and SHIBU YOOSEPH2

ABSTRACT

Zinc � nger genes in mammalian genomes are frequently found to occur in clusters withcluster members appearing in a tandem array on the chromosome. It has been suggestedthat in situ gene duplication events are primarily responsible for the evolution of suchclusters. The problem of inferring the series of duplication events responsible for producingclustered families is different from the standard phylogeny problem. In this paper, we studythis inference problem using a graph called duplication model that captures the seriesof duplication events while taking into account the observed order of the genes on thechromosome. We provide algorithms to reconstruct a duplication model for a given data set.We use our method to hypothesize the series of duplication events that may have producedthe ZNF45 family that appears on human chromosome 19.

Key words: gene duplication, algorithms, evolutionary history.

1. INTRODUCTION

Zinc � nger (ZNF) genes code for proteins that contain one or more zinc � nger motifs. The zinc� nger motif is one of the most common motifs involved in nucleic acid–protein interaction. This

motif was � rst discovered in the transcription factor TFIIIA in Xenopus (Miller et al., 1985). Since then,the database of ZNF genes has been increasing rapidly. Of the various zinc � nger motif types, the onemost commonly occurring is the C2H2 type. In fact, it is estimated that mammalian genomes containbetween 600 and 1,000 genes that code for C2H2-type zinc � nger proteins (Hoovers et al., 1992; Shannonet al., 1998). Experimental studies on functions of these ZNF genes suggest that many of them codefor transcription factors, and some of them are known to take part in cellular growth and development(El-Baradi and Pieler, 1991). However, the biological functions of most of these ZNF genes are currentlyunknown.

The various sequencing projects reveal an interesting picture regarding the organization of these ZNFgenes in mammalian genomes. In mammals, ZNF genes tend to occur as clusters (families) scatteredthroughout the various chromosomes. For instance, studies (Ashworth et al., 1995; Shannon et al., 1996)on human chromosome 19 (H19) indicate that there are at least 100 ZNF genes on this chromosome; in

1ACLARA BioSciences, 1288 Pear Street, Mountain View, CA 94043.2Celera Genomics, 45 West Gude Drive, Rockville, MD 20850.3Department of Mathematics, University of Southern California, Los Angeles, CA 90089.

429

Page 2: Zinc Finger Gene Clusters and Tandem Gene Duplication

430 TANG ET AL.

addition, these genes are found to occur in clusters in � ve major sites. One of the most striking features ofthe organization of these families is that the family members appear in a tandem array on the chromosome.

Most studies of zinc � ngers have focused on the three-dimensional structure of � ngers and on thebiochemistry of DNA recognition by zinc � ngers. However, less attention has been paid to the organizationand evolution of ZNF gene families. In this paper, we undertake a study of the evolution of ZNF genefamilies whose members occur in a tandem array on the same chromosome. Evolutionary studies can shedlight on the mechanisms of evolution of such gene families. Also, evolutionary studies in conjunction withother analyses (for instance, analysis of upstream sequences) can yield valuable insights into the regulatorymechanisms controlling these genes and thus provide clues to their functions.

2. REPRESENTING EVOLUTIONARY HISTORY USING ADUPLICATION MODEL

We introduce the evolutionary inference problem by using the ZNF45 gene family as an example; theorganization of the members of this family is typical of several known zinc � nger gene families. TheZNF45 gene family is found in the q13.2 gene cluster on H19 chromosome (Shannon et al., 1998). Thisfamily is estimated to have between 15 and 20 members. Each member in this family is a KRAB-ZNFgene; that is, each member has a Kruppel-associated box (KRAB) at the head and zinc � ngers at thetail. The head is separated from the tail by a spacer element. The KRAB is a non-ZNF sequence that isknown to take part in transcriptional repression (Margolin et al., 1994), and it is highly conserved in eachmember of the family. Every zinc � nger is C2H2 type and has 28 amino acids. A typical C2H2-type � ngersequence is of the form CysX2CysX3PheX5LeuX2HisX3HisX7 (with X being any amino acid) (Shannonet al., 1998; El-Baradi and Pieler, 1991). The members of the ZNF45 family have differing numbers ofzinc � ngers. The members are approximately evenly spaced on the chromosome and are arranged in atandem array.

The organization and features of the members of the ZNF45 gene family suggest that the members in thefamily may have been produced by a series of in situ tandem gene duplication events (Ohno, 1970; Shannonet al., 1998). The duplication process operates on a unit of one or more genes. It copies the unit and placesthe copy adjacent to the original. A mechanism for such a process could be unequal crossing-over duringmeiosis or mitosis that occurs in the germ line (Li and Graur, 1991). In our model of evolution, we assumethat transposition events do not occur; that is, a gene cannot be excised and reinserted in another regionin the chromosome.

The problem of inferring the duplication history of tandem repeats has been studied in a recent work(Benson and Dong, 1999). This work de� nes several variants of the duplicationhistory inference problem. Itnotes that inferring evolutionary relationships of sequences evolving through tandem duplication is differentfrom the traditional phylogeny inference problem as there is the additional constraint imposed by the orderof the sequences in the genome. We argue similarly and introduce the notion of a duplication model—a graph that can be used to capture the series of duplication events and the evolutionary relationships whiletaking into account the observed left-to-right order of the genes on the chromosome.

A duplication model is a directed graph that has nodes, edges, and blocks. A node represents a gene, anedge between two nodes represents a parent–child relationship, and a block represents a gene duplicationevent. Certain edges in a duplication model are allowed to cross each other; this is described using anedge-crossing rule. Figure 1 shows an example of a duplication model.

There is a special node designated the root node r that has only outgoing edges; the root represents theancestral gene of the genes in our data set. An edge e D .u; v/ from node u to node v indicates that u

is the parent of v. A node that has no outgoing edges is called a leaf node. The set of leaves is denotedby L . The leaves of a duplication model represent our data set (that is, the genes that we observe on thechromosome). The left-to-right order of the leaf nodes (genes) de� ned by the duplication model is thesame order in which these genes appear on the chromosome; for instance, in Fig. 1, the left-to-right orderis s1; s2; s3; s4; s5; s6; s7; s8; s9; s10, and s11.

Every nonroot node in a duplication model has exactly one incoming edge and every nonleaf node hasexactly two outgoing edges. A node u is an ancestor of node v if u is on the path from r to v; in thiscase, v is also said to be a descendent of u. The direction of the edges in Fig. 1 has been omitted with

Page 3: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 431

FIG. 1. Example of a duplication model on a leaf set L D fs1; s2; : : : ; s11g. The six blocks are represented by boxes;they are B1 D hri, B2 D hv1i, B3 D hv2; v3i, B4 D hv4; v5i, B5 D hv6; v7i, B6 D hv8; v9i.

the understanding that the orientation is from top to bottom; i.e., a node appears below its ancestor in thegraph.

A block consists of one or more nonleaf nodes. A block with one node represents a single gene duplicationevent, and a block with at least two nodes represents a multigene duplication event. Every nonleaf node ispart of exactly one block. No node in a block is the ancestor (or descendent) of another node in the sameblock. The following rule applies to the children of nodes in a block: let B D hvi1; vi2 ; : : : ; vib i denote ablock containing genes vi1; vi2 ; : : : ; vib in left-to-right order. Then, for each 1 · k < l · b, the duplicationevent places vk’s left child to the left of vl’s left child and vk’s right child to the left of vl’s right child; thisis an alternate way of stating that a duplication event places the copy of a block adjacent to the original.

Certain edges in a duplication model cross each other as described by the following edge-crossing rule.Let block B D hvi1 ; vi2 ; : : : ; vib i and let lc.v/ and rc.v/ denote the left child and right child, respectively,of a node v. Then, for every k, l, where 1 · k < l · b, the edges .vk; rc.vk// and .vl; lc.vl// cross eachother. No other edges in a duplication model cross each other.

If we represent only the parent–child relations de� ned by a duplication model DM, then the resultingstructure TDM is planar and can be drawn without the edges having to cross each other. This is apparentfrom the de� nition of nodes and edges in a duplication model. The graph TDM is the rooted binary tree forthe given set of genes and is said to be the associated phylogeny for DM. Clearly, the associated phylogenyof DM is unique. Figure 2 shows an example of an associated phylogeny of a duplication model. Whilethe nodes in DM and TDM are the same, the key difference between TDM and DM is that TDM does notcontain blocks and thus cannot directly explain the duplication events that produced the genes we observe.

In summary, a duplication model captures three aspects of the evolutionary history of a gene family:the ancestral relationship, the duplication history, and the ordering of the members on the chromosome.

FIG. 2. The associated phylogeny of the duplication model in Fig. 1.

Page 4: Zinc Finger Gene Clusters and Tandem Gene Duplication

432 TANG ET AL.

For the remainder of the paper, we assume that the set L D fs1; s2; : : : ; sng of genes under considerationappear as s1; s2; : : : ; sn on the chromosome; we say that a duplication model is consistent with L if theleft-to-right ordering of its leaves is s1; s2; : : : ; sn .

The remainder of the paper is organized as follows. In Section 3, we de� ne the cost of a duplicationmodel and formulate the problem of inferring a duplication model of minimum cost. We provide algorithmsto solve a restricted version of the inference problem. In Section 4, we provide a general method to constructa duplication model and discuss the performance of this method using simulation. In Section 5, we addressthe following question: given a phylogenetic tree T , does there exist a duplication model DM such that itsassociated phylogeny TDM is equal to T ? In Section 6, we apply the method described in Section 4 to theZNF45 family.

3. AN INFERENCE PROBLEM

Let L D fs1; s2; : : : ; sng be a set of genes where each gene si is a sequence of length k, with eachposition (or site) in the gene having a label from an alphabet 6 of size r . Thus, each si 2 6k . Let Dist bea pairwise distance measure on sequences in 6k that satis� es the triangle inequality.

Let DM be a duplication model consistent with L . Further, let s.v/ 2 6k denote the sequence at internalnode v in DM. The size of a block in DM is the number of nodes in that block. We use C.DM/ to denotethe cost of DM and de� ne it as

C.DM/ DX

block b

Cd.bi/ CX

edge e

C.e/;

where Cd.bi/ denotes the duplication cost for block b of size bi and C.e/ D Dist.s.u/; s.v// for edgee D .u; v/. For instance, Dist can be the Hamming distance between the sequences.

Our discussions motivate the following inference problem.

Gene duplication problemInput: Set L D fs1; s2; : : : ; sng of genes and integer b.Output: A duplication model DM consistent with L having maximum block size b, and of minimum cost.

In the remainder of this section, we consider the simplest version of the gene duplication problem—thesingle gene duplication problem. In this problem, the maximum block size b D 1; alternatively, Cd.1/ D 0and Cd .bi/ D 1, 8bi > 1.

3.1. Single gene duplication

The single-gene duplication problem is essentially a phylogeny inference problem but with the additionalconstraint that we seek a phylogeny whose planar representation (i.e., edges not allowed to cross) inducesthe same ordering on the genes as their ordering on the chromosome. We refer to such a tree as an orderedtree.

An ordered tree on a set L D fs1; s2; : : : ; sng is a rooted tree with n leaves in which each leaf has aunique sequence from L and such that the left-to-right ordering of the leaves is s1; s2; : : : ; sn.

Single-gene duplication problem (Maximum Parsimony)Input: Set L D fs1; s2; : : : ; sng of genes, where each gene has length k.Output: An ordered tree T (on L ) in which each internal node is labeled with a sequence of length k andsuch that the cost of T is minimum.

This problem has been shown to be NP-hard (Jaitly et al., 2001). The NP-hardness proof involves areduction from the MAX CUT problem for cubic graphs (graphs where every node has degree 3), whichis a known NP-hard problem (Papadimitriou and Yannakakis, 1991).

The single gene duplication problem has been addressed by Benson and Dong (1999) where it iscalled the tandem repeat history (TRHIST) problem with � xed boundary and � xed duplication size. An

Page 5: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 433

approximation algorithm with ratio 2 has been given. This algorithm uses the concept of an orderedspanning tree. An ordered spanning tree is a spanning tree on an ordered set of nodes (the nodes arenumbered in order from left to right) with the property that for any two edges .i1; j1/ and .i2; j2/, i1 < j1,i2 < j2, we have .i1 ¡ i2/.i1 ¡ j2/.j1 ¡ i2/.j1 ¡ j2/ ¸ 0. The authors (Benson and Dong, 1999) showthat every ordered spanning tree can be transformed into a solution of equal cost for the TRHIST problemwith � xed boundary and � xed duplication size. In addition, they make the observation that there existsan ordered spanning tree whose cost is at most twice the cost of the optimal solution to the TRHISTproblem with � xed boundary and � xed duplication size. They then provide an algorithm to compute aminimum-cost ordered spanning tree.

In this paper we provide a dynamic programming framework that provides algorithms whose quality(of solution) and running time complexity can be controlled by choice of different sequence labels at theinternal nodes. Using this framework, we provide an exact algorithm and then an approximation algorithmwith ratio 2 (this algorithm is different from the one of Benson and Dong [1999]). We then modify thisapproach to provide a polynomial time approximation scheme for the problem.

Algorithms: We describe a dynamic programming framework for solving the single gene duplicationproblem.

For 1 · i · j · n, we use [i; j ] to denote the gene set fsi ; siC1; : : : ; sj g whose elements appearin consecutive order on the chromosome. We say that an ordered tree T spans [i; j ] if T has leaf setfsi; siC1; : : : ; sj g and the left-to-right order of the leaves in T is si ; siC1; : : : ; sj .

Consider an ordered tree T that spans [i; j ]. This tree is obtained by combining ordered subtrees T1

and T2 that span [i; m] and [m C 1; j ], respectively, for some i · m < j . Let C.[i; j ]; s; m/ denote theminimum cost among all ordered trees T that span [i; j ] and have root labeled with s 2 6k , and such thatT ’s left subtree spans [i; m] and T ’s right subtree spans [m C 1; j ]. Let C.[i; j ]; s/ denote the minimumcost among all ordered trees that span [i; j ] and have root labeled with s. Let C.[i; j ]/ denote the minimumcost among all ordered trees that span [i; j ]. Then, we have the following recurrence relations:

C.[i; j ]; s; m/ D mins¤2S1

[C.[i; m]; s¤/ C Dist.s; s¤/]

C mins¤¤2S2

[C.[m C 1; j ]; s¤¤/ C Dist.s; s¤¤/] (3.1)

C.[i; j ]; s/ D mini·m<j

C.[i; j ]; s; m/ (3.2)

C.[i; j ]/ D mins2S

C.[i; j ]; s/ (3.3)

where Dist is the pairwise distance measure on sequences in 6k , and S , S1, and S2 are subsets of 6k .Figure 3 gives the dynamic programming framework. The dynamic programming algorithm works byconsidering each subproblem [i; j ] in the order of increasing cardinality (note that the cardinality of asubproblem [i; j ] is j ¡ i C 1). The base cases are C.[i; i]; si/ D 0, 81 · i · n, and C.[i; i]; sj / D 1,sj 6D si . The minimum-cost ordered tree can be recovered by keeping track of its subtrees.

The framework we discussed de� nes algorithms for which the quality (of solution) and running timecan be controlled by the choice of S , S1, and S2. If we set S D S1 D S2 D 6k , then we are allowingall possible choices of sequence labels. It is clear that the solution returned by the dynamic programmingalgorithm is optimal. The running time of the algorithm is computed as follows: all pairwise distancesof the sequences in 6k can be computed in time O.r2kk/, where r D j6j; there are O.n2/ subproblemsof the form [i; j ], and each subproblem [i; j ] can be solved in O.r2kn/. Thus the total running time isO.r2k.k C n3//.

Lemma 3.1. The single gene duplication problem can be solved exactly in O.r2k.k C n3//.

We obtain a 2-approximation algorithm by combining the dynamic programming described above withthe technique of lifting (Wang et al., 1996). Consider a rooted tree T in which node v is labeled withsequence s.v/ 2 6k . Lifting is a process that replaces the sequence at a node v with a sequence of oneof the leaves in the subtree rooted at v. Lifting operates in a bottom-up fashion; i.e., a node is lifted only

Page 6: Zinc Finger Gene Clusters and Tandem Gene Duplication

434 TANG ET AL.

FIG. 3. Solving the single gene duplication problem.

after all its children have been lifted. Suppose we are at a node v (labeled with sequence s.v/) and wehave already lifted its children v1 and v2, whose new sequences are denoted by si1 and si2 , respectively(where si1 2 L is in the subtree rooted at v1 and si2 2 L is in the subtree rooted at v2). Then, to lift v, wereplace s.v/ with si1 if Dist.s.v/; si1/ · Dist.s.v/; si2 /; otherwise, we replace s.v/ with si2 .

Lemma 3.2. Let Dist be a pairwise distance measure that satis� es the triangle inequality. For a treeT D .V ; E/, let cost of T be given by

C.T / DX

.u;v/2E

Dist.s.u/; s.v//:

Let T ¤ be a most-parsimonious ordered tree for the single gene duplication problem and let T L be thetree obtained from T ¤ by the lifting process described above. Then C.T L/ · 2C.T ¤/.

The proof of the above lemma follows from the results of Wang et al. (1996) and Wang and Gus� eld(1997). The above lemma shows the existence of an ordered tree T L that is lifted and whose cost is atmost twice the optimal cost.

Recall our dynamic programming framework. If we set S D [i; j ], S1 D [i; m], and S2 D [m C 1; j ],then the algorithm returns an ordered tree T that is lifted and has minimum cost among all lifted orderedtrees. From the above lemma, we have C.T / · C.T L/ · 2C.T ¤/. Since lifting considers only labels fromL , the total running time is O.n2.k C n3//.

Lemma 3.3. The single gene duplication problem has a 2-approximation algorithm with running timeO.n2.k C n3//.

The lifting technique can be combined with local optimization to provide a polynomial time approxi-mation scheme (PTAS) for the single gene duplication problem. A PTAS is an algorithm which, for every² > 0, returns a solution whose cost is at most .1 C ²/ times the cost of the optimal solution and runs intime bounded by a polynomial (depending on ²) in the input size. The key lemma for designing the PTASis based on the results on the problem of tree alignment with a given phylogeny (Wang et al., 1996; Wangand Gus� eld, 1997).

Page 7: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 435

Lemma 3.4, which we give later on, follows from the corresponding lemmas of Wang et al. (1996) andWang and Gus� eld (1997) and thus we omit its proof in this paper. The crucial assumption in these lemmasis that the pairwise distance measure Dist satis� es the triangle inequality. We describe our adaptation oftheir results to the single gene duplication problem.

Let ordered tree T (with root r) be an optimal solution to the single gene duplication problem. For everynode v 2 T , we use s.v/ to denote the sequence (of length k) at v. Also, for each node v 2 T , let Tv

denote the subtree of T rooted at v. Among all leaves in Tv , let l.v/ be that leaf whose path from it to v

is the shortest. We call l.v/ the closest descendent leaf of v. Let sl.v/ denote the sequence label of l.v/.For node v 6D r, we de� ne the depth-t component Tv;t as the subtree of Tv containing only the top t C 1

levels. We use C.Tv;t / to denote the cost of Tv;t labeled in the following manner: node v (root of Tv;t ) islabeled with sl.v/ and each leaf u 2 Tv;t with sl.u/; the remaining nodes in Tv;t are labeled so that thecost of Tv;t is minimized. These remaining labels can be identi� ed using dynamic programming (Fitch,1971; Hartigan, 1973). Note that, if a node v is near the bottom level of T , the real depth t 0 of Tv;t couldbe less than t . In this case, we process Tv;t 0 in the same manner as for Tv;t . Now, for each i D 0; : : : ; t ¡1,let Tr;i denote the subtree of T containing only the top i C 1 levels. We use C.Tr;i/ to denote the cost ofTr;i labeled in the following manner: each leaf u in Tr;i is labeled with sl.u/ and every nonleaf node isthen labeled so as to minimize the cost of Tr;i . As before, these labels can be identi� ed using dynamicprogramming.

Wang et al. (1996) show that

X

v 6Dr

C.Tv;t/ Ct¡1X

iD0

C.Tr;i/ · .t C 3/C.T /: (3.4)

The left hand side of the sum can be rewritten as

tX

iD1

C.¿i/ · .t C 3/C.T /

where each ¿i is a labeled tree obtained by combining Tr;i with the necessary depth-t components so that¿i has the same topology as T . Each ¿i is called a depth-t component tree. Thus, we have that there issome ¿j for which C.¿j / · .1 C 3

t/C.T /. In Wang and Gus� eld (1997), a tighter bound is shown for the

left-hand-side sum of Equation 3.4 which results in the following lemma.

Lemma 3.4. There exists ¿j for which C.¿j / · .1 C 2t

¡ 2t2t /C.T /.

If we knew the leaf-labeled topology of T , then ¿j of the above lemma can easily be identi� ed usingdynamic programming. However, since we do not know T , we have to do additional work. The PTAS wedescribe is a dynamic programming algorithm that considers subproblems of the form [i; j ].

Let T[i;j ] denote an ordered tree that spans [i; j ], and let T[i;j ] D fT[i;j ]g. Also, let T[i;j ];t .u1; u2; : : : ; ulIi1; i2; : : : ; il¡1/ denote the ordered subtree of T[i;j ] satisfying the following properties:

² it has l leaves denoted by u1; u2; : : : ; and ul (in left-to-right order),² every root-leaf path has at most t edges, and² subtrees of T[i;j ] rooted at nodes u1; u2; : : : ; and ul span [i; i1], [i1 C 1; i2]; : : : ; and [ii¡1 C 1; j ],

respectively.

Each T[i;j ];t.u1; u2; : : : ; ulI i1; i2; : : : ; il¡1/ is referred to as an ordered depth-t component tree. Notethat each ordered depth-t component tree has at most 2t leaves. We de� ne T[i;j ];t as

T[i;j ];t D fT[i;j ];t .u1; u2; : : : ; ulI i1; i2; : : : ; il¡1/g;

where T[i;j ] 2 T[i;j ]; 2 · l · 2t , and i · i1 < i2 : : : < il¡1 < j .Our dynamic programming algorithm works by considering each ordered tree T[i;j ];t .u1; u2; : : : ; ulI

i1; i2; : : : ; il¡1/. The root is labeled with a sequence from [i; j ]. The leaves are labeled as follows: leaf

Page 8: Zinc Finger Gene Clusters and Tandem Gene Duplication

436 TANG ET AL.

FIG. 4. Solving the single gene duplication problem.

u1 is labeled with a sequence from [i; i1]; leaf u2 from [i1 C 1; i2], and so on. Then the sequences at theother nonroot internal nodes in the tree are computed (using dynamic programming [Fitch, 1971; Hartigan,1973]) such that the cost is minimized.

Let C .[i; j ]; t/ denote the minimum cost among all ordered trees that span [i; j ] and are themselvesconstructed from ordered depth-t component trees. Let C .[i; j ]; tI s/ denote the minimum cost among allordered trees that span [i; j ] (and are themselves constructed from ordered depth-t component trees) andhave sequence s 2 [i; j ] at the root. Also, let C.[i; j ]; t I sI b1; b2; : : : ; bl/ denote the minimum cost ofT[i;j ];t .u1; u2; : : : ; ulI i1; i2; : : : ; il¡1/ in which the root has label s 2 [i; j ] and for 1 · p · l, up hassequence label bp , where b1 2 [i; i1], b2 2 [i1 C 1; i2]; : : : ; and bl 2 [il¡1 C 1; j ]. Then we have thefollowing recurrence relations:

C .[i; j ]; t/ D mins2[i;j ]

C .[i; j ]; tI s/; (3.5)

C .[i; j ]; tI s/ D minA

f minb1;b2;:::;bl

fC.[i; j ]; tI sI b1; b2; : : : ; bl/ C C .[i; i1]; tI b1/

C C .[i1 C 1; i2]; tI b2/ C : : : C C .[il¡1 C 1; j ]; tI bl/gg; (3.6)

where A denotes the ordered depth-t component tree T[i;j ];t .u1; u2; : : : ; ulI i1; i2; : : : ; il¡1/ that is anelement of T[i;j ];t . The algorithm is given in Fig. 4. From Lemma 3.4 and the recurrences 3.5 and 3.6, wesee that the algorithm returns a tree whose cost is at most .1 C 2

t¡ 2

t2t / times the optimal cost.Running time complexity: We � rst compute the number of ordered depth-t component trees in T[i;j ];t .

Let F .t/ denote the set of ordered binary tree topologies of height at most t . An upper bound for jF .t/jis O.22tC1

/. Each topology (from F .t/) with l leaves can be labeled so as to obtain T[i;j ];t .u1; u2; : : : ; ulIi1; i2; : : : ; il¡1/—a member of T[i;j ];t . Recall that l · 2t and i · i1 < i2 : : : < il¡1 < j . Thus, the number

of choices for i1; i2; : : : ; and il¡1 is¡j¡i

l

¢which is in O.n2t

/. Thus jT[i;j ];t j D O.22tC1n2t

/. Now, for each

tree T[i;j ];t.u1; u2; : : : ; ulI i1; i2; : : : ; il¡1/ 2 T[i;j ];t , there are O.n2tC1/ choices of labels where the root is

labeled with a seuqence s 2 [i; j ], u1 from [i; i1];, u2 from [i1 C1; i2]; : : : ; and ul from [il¡1 C1; j ]. Also,

Page 9: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 437

for each choice of labels, it takes O.lrk/ time to compute the sequence labels at the nonroot internal nodesso as to minimize the cost (Fitch, 1971; Hartigan, 1973). Finally, there are O.n2/ subproblems of the form[i; j ]. Thus, the total running time is bounded by O.2tC2tC1

n2tC1C3rk/. For a � xed t , this is polynomialin the input size.

Lemma 3.5. The single gene duplication problem has a PTAS.

4. WINDOW METHOD

In this section, we describe a general distance-based method for reconstructing a duplication model for agiven set of genes. This method is analogous to other distance-based methods such as UPGMA (Nei, 1975;Sokal and Michener, 1958) and Neighbor-Joining (Saitou and Nei, 1987); a similar bottom-up method,called GREEDY-TRHIST, which is sequence based, is proposed by Benson and Dong (1999). Our method,called the window method, detects both single gene and multigene duplication events.

The window method works with a list L of genes. Initially, list L ¡ .s1; s2; : : : ; sn/, that is, the list ofall the genes in the order they appear on the chromosome. The window method operates on sublists of Lcalled windows. It identi� es a pair of adjacent nonoverlapping windows (of same length) that are closestto each other. List L is then modi� ed by merging these two windows. The process of identifying closestadjacent windows and merging them is repeated until the list L has a single gene. The duplication modelis reconstructed by recording the sequence of window merges. We now discuss the details.

Let L D .a1; a2; : : : ; ap/ be the list of genes at the start of the current iteration. A window Wof length l is a sublist .ai ; aiC1; : : : ; aiCl¡1/ of l consecutive genes in L . Let D.W1; W2/ denote thedistance between two adjacent nonoverlapping windows W1 and W2 of same length l, where W1 D.ai ; aiC1; : : : ; aiCl¡1/ and W2 D .aiCl; aiClC1; : : : ; aiC2l¡1/. Among all adjacent nonoverlapping win-dows of the same length, the window pair .W1; W2/ is said to be a closest pair if D.W1; W2/ isminimum. The merging step in the window method involves identifying a closest pair .W1; W2/ andmerging them to give a new window .a0

i ; a0iC1; : : : ; a0

iCl¡1/. The list L is then modi� ed resulting inL D .a1; a2; : : : ; ai¡1; a0

i ; a0iC1; : : : ; a0

iCl¡1; aiC2l; : : : ; ap/ at the end of the iteration.Several de� nitions for D, the distance measure on windows, are possible. Recall that Dist denotes

the pairwise distance measure between genes. We implemented the window method with the followingde� nitions:

D.W1; W2/ DPjDiCl¡1

jDi Dist.aj ; ajCl/

l;

where W1 D .ai ; aiC1; : : : ; aiCl¡1/ and W2 D .aiCl; aiClC1; : : : ; aiC2l¡1/. The pairwise distances betweenthe genes in L are updated using the following equations:

Dist.a0r ; a1/ D

Dist.ar ; aq/ C Dist.arCl; aq /

2;

where i · r · i C l ¡ 1, and

either 1 · q · i ¡ 1 or i C 2l · q · p

Dist.a0r ; a0

q/ DDist.ar ; aq/ C Dist.ar ; aqCl/

4

CDist.arCl ; aq/ C Dist.arC1; aqCl/

4;

where i · r < q · i C l ¡ 1:

The running time of the window method is O.n4/. The algorithm for the window method is given inFig. 5.

Page 10: Zinc Finger Gene Clusters and Tandem Gene Duplication

438 TANG ET AL.

FIG. 5. Construction of a duplication model for a given set of genes. Comments are inserted between =¤::¤=.

4.1. Simulation results

We assess the performance of the window method using simulations. In each run of the simulation, aduplication model and the gene sequences are generated for the corresponding values of the parameters.These sequences are used to compute a distance matrix which is given as input to the window method.The performance of the window method on the run is measured by the exact recovery of the originalduplication model. We note here that other evaluation criteria have been used in similar evolutionaryinference simulation studies (for instance by Benson and Dong [1999], where the evaluation is based onthe cost of the solution returned by the algorithm).

The parameters in our study are k: the length of each gene sequence, p: the per-generation mutationrate, and f : the number of generations. The duplication model and the gene sequences are generated asfollows. A list geni is used to keep track of all genes at generation i as they appear on the chromosome.List gen1, corresponding to the � rst generation, has a single gene of length k where each position in thegene is one of the four nucleotides A; C; T ; G. Suppose geni has r1 genes. Then geniC1 is obtained fromgeni by selecting a random sublist (say, of length r2), duplicating this sublist, and placing the copy nextto it in the list; geniC1 thus has r1 C r2 genes, each of length k. Finally, for each gene, the nucleotideat each site is allowed to mutate to a different nucleotide, independently and identically, with probabilityp. The gene sequences are allowed to evolve in this manner for f generations. The duplication model isobtained by keeping track of the geni ’s.

The above procedure was used to generate duplication models and sequence data for p from 0.01 to 0.5(increments to 0.05), k from 100 to 800 (increments of 100), and f D 6; 10. For each combination of theparameter values, 200 data sets were generated. We note that f D 6 resulted in an average of 25 genes ina data set and f D 10 resulted in an average of 55 genes in a data set. From each data set, the alignedgene sequences were used to create a distance matrix (using Hamming distances) and the window methodwas used to infer a duplication model from this distance matrix. The duplication model was comparedwith the original. The performance is measured by the fraction of runs for which the inferred duplicationmodel is equal to the original duplication model. Figure 6 reports two graphs, one with � xed sequencelength of 600 base pairs (with varying mutation rates and number of generations) and the other with � xedmutation rate of 0.05 (with varying sequence length and number of generations).

The graphs in Fig. 6 show the effects of mutation rate, sequence length, and number of generationson the performance of the window method. For a � xed sequence length, the performance of the window

Page 11: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 439

FIG. 6. Performance of the window method.

method is good at low and moderate ends of the mutation rate scale; however, the performance drops offas the mutation rate increases. Also, for a � xed mutation rate, the performance of the window methodimproves with an increase in the sequence length. The two graphs also show that the performance ofthe window method degrades with increasing f . Similar observations have been made in the context ofphylogeny inference from simulated data (Huson et al., 1999; Swofford et al., 1996).

The simulation study reported here is a preliminary one. Several modi� cations can be made to improvethe performance of the window method. These include using an alternative distance update formula (fromthe one discussed in Section 4) and using appropriate distance correction (Swofford et al., 1996) to accountfor the assumptions of the evolutionary model; we used uncorrected distances in our study. Finally, wenote that in our study, performance is measured by the exact recovery of the original duplication model. Itis also possible to measure the partial recovery of the original duplication model (in terms of number ofedges and blocks that are recovered); this may provide a better insight into the performance of the windowmethod at high mutation rates.

5. RECONSTRUCTING A DUPLICATION MODEL FROM A PHYLOGENY

From our discussions in Section 2, recall that every duplication model DM has a unique associatedphylogeny TDM . In this section, we consider the inverse question, namely, given a phylogeny T , does thereexist a duplication model DM such that T D TDM?

Figure 7(a) shows a phylogeny T on leaf set L 1 D fs1; s2; s3g that is not the associated phylogeny ofany duplication model. This is easily seen by considering all possible duplication models (they are shownin Fig. 7(b)) that are consistent with L 1; none of the duplication models have an associated tree equalto T . Figure 7(c) shows a phylogeny on leaf set L 2 D fs1; s2; s3; s4g that is not the associated phylogenyof any duplication model. These examples imply that given a phylogeny T , a duplication model which hasT as its associated phylogeny need not exist.

In the remainder of this section, we show that if a duplication model DM exists for a given phylogeny T

such that T D TDM , then DM is unique. This is done by using an encoding scheme that uniquely de� nes aduplication model. We conclude this section by providing an algorithm to construct the duplication model(when it exists) from a given phylogeny.

Page 12: Zinc Finger Gene Clusters and Tandem Gene Duplication

440 TANG ET AL.

FIG. 7. Counter example.

De� nitions. Let DM be a duplication model consistent with L D fs1; s2; : : : ; sng. As before, we uselc.v/ to denote the left child of v and rc.v/ to denote the right child of v. A leaf in DM that is labeledwith si is said to have index i. The height of a node v in DM is de� ned to be the number of edges in thepath from v to root r . The height of a block B is de� ned inductively as follows: the height of the blockcontaining root r is 0; if H .parent.v// denotes the height of the block containing the parent of node v,then height of block B D maxvfH .parent.v//jv is in block Bg C 1. We use MAX_HEIGHT to denote themaximum height of a block in DM. For instance, the heights of the blocks in Fig. 2 are as follows: heightof B1 D 0, height of B2 D 1, height of B3 D 2, height of B4 D 3, height of B5 D 4, and height of B6 D 5.

Encoding of nodes in DM: We de� ne an encoding of DM, denoted by Enc.DM/, in which each nodev is represented by a pair .Lv; Rv/ of leaf indices; this pair is called the LR pair of v. The LR pairis de� ned as follows: 1. The LR pair of a leaf with index i is .i; i/; 2. The LR pair of a node v is.Lv; Rv/ D .Llc.v/; Rrc.v//, where .Llc.v/; Rlc.v// and .Lrc.v/; Rrc.v// are the LR pairs of v’s left and rightchildren, respectively. We de� ne Enc.DM/ D f.Lv; Rv/jv 2 DMg.

We describe a property of Enc.DM/ that is useful for our purposes. To describe this property, weintroduce the concept of a generation list genh—a list containing nodes of a certain height in DM. A genh

list is de� ned inductively, for 0 · h < MAX_HEIGHT C1, as follows: gen0 D .r/; genhC1 is obtainedfrom genh by the following operation: every sublist hvp1 ; vp2 ; : : : ; vpb

i in genh that corresponds to a blockheight i is replaced with the sublist hlc.vp1 /; lc.vp2 /; : : : ; lc.vpb

/; rc.vp1/; rc.vp2 / : : : ; rc.vpb/i.

For instance, the generation lists for the duplication model in Fig. 1 are:

gen0 D .r/; gen1 D .v2; v1/; gen2 D .v2; v3; v9/;

gen3 D .s1; v4; v5; v7; v9/;

gen4 D .s1; s2; s3; s4; v6; v7; v9/;

gen5 D .s1; s2; s3; s4; s5; s7; v8; v9/;

gen6 D .s1; s2; s3; s4; s5; s6; s7; s8; s9; s10; s11/:

Lemma 5.1. Let DM be a duplication model consistent with L D fs1; s2; : : : ; sng. Let Enc.DM/

denote the set of LR pairs of all the nodes in DM. For any 0 · h · .MAX_HEIGHTC1/, let genh D.vi1 ; vi2; : : : ; vik /. Then Lvi1

< Lvi2< : : : < Lvik

and Rvi1< Rvi2

< : : : < Rvik.

Proof. The proof is by reverse induction on h. When h D MAX_HEIGHTC1, genh consists of all theleaf nodes in the left-to-right order s1; s2; : : : ; sn . From the de� nition of an LR pair for a leaf, we have.Lsi

; Rsi/ D .i; i/. Thus we have Ls1 < Ls2 : : : < Lsn and Rs1 < Rs2 < : : : < Rsn . Suppose the statement

is true for genh0 where h0 · MAX_HEIGHT . We show that it is true for genh0¡1.

Page 13: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 441

Now, genh0 is obtained from genh0¡1 by replacing each block (which is a sublist) of the formhvp1; vp2 ; : : : ; vpb

i with the sublist hlc.vp1/; lc.vp2/; : : : ; lc.vpb/; rc.vp1 /; rc.vp2/; : : : ; rc.vpb

/i. In genh0 ,by inductive hypothesis we have Llc.vp1 / < Llc.vp2 / < : : : < Llc.vpb

/ < Lrc.vp1 / < Lrc.vp2 / < : : : <

Lrc.vpb/ and Rlc.vp1 / < Rlc.vp2 / < : : : < Rlc.vpb

/ < Rrc.vp1 / < Rrc.vp2 / < : : : < Rrc.vpb/. Now, for a

node v, Lv D Llc.v/ and Rv D Rrc.v/. Thus, in genh0¡1, we have that Lvp1< Lvp2

< : : : < Lvpband

Rvp1< Rvp2

< : : : < Rvpb.

Next, consider a node u that appears to the left of lc.vp1/ in genh0 (a similar argument can be used fornode u appearing to the right of rc.vpb

/). Either node u or node u’s parent w is present in genh0¡1. Ifnode u is present in genh0¡1, then Lu < Llc.vp1 / (from inductive hypothesis) and thus Lu < Lvp1

; also,Ru < Rlc.vp1

/ < Rrc.vp1/ (from inductive hypothesis) and thus Ru < Rvp1

. Similarly, if w is present isgenh0¡1, then Lw · Lu < Llc.vp1

/ D Lvp1. Also, w’s right child u0 (this could be either u or u’s sibling)

appears to the left of lc.vp1/; thus, Rw D Ru0 < Rlc.vp1/.

The arguments given above together imply that the induction holds for genh0¡1.

Corollary 5.1. Let DM be a duplication model consistent with L Dgs1; s2; : : : ; sng. Then

1. No two nodes in DM have the same LR pair.2. Let v be a nonleaf node in DM. Then Lv D Llc.v/ < Lrc.v/ and Rlc.v/ < Rv D Rrc.v/.3. For a nonleaf node v with LR pair .Lv; Rv/, Lv < Rv .4. For any node v with LR pair .Lv; Rv/, the following is true: Lv is the smallest index of a leaf that is

a descendent of v and Rv is the largest index of a leaf that is a descendent of v.5. Let B D hvp1; vp2 ; : : : ; vpb

i be a block in DM. Then, Lvp1< Lvp2

< : : : < Lvpb< Rvp1

< Rvp2<

: : : < Rvpb.

6. Let the LR pairs for nodes v and u be .Lv; Rv/ and .Lu; Ru/, respectively.a. Let Lu D Lv . Then Ru < Rv iff u is a descendent of v.b. Let Ru D Rv . Then Lu < Lv iff u is a descendent of v.

Proof. The proofs of Part (1) and Part (2) follow directly from the de� nition of LR pairs and fromLemma 5.1.

Part (3): The proof is by induction on the nodes in DM. Consider a nonleaf node v in DM withLR pair .Lv; Rv/. Inductively, assume that every nonleaf descendent u, of v, has LR pair .Lu; Ru/ with1 · Lu < Ru · n. If v is present in geni , then v’s children are present in geniC1. From Lemma 5.1,Llc.v/ < Lrc.v/ and Rlc.v/ < Rrc.v/. Then, Lv D Llc.v/ < Lrc.v/ < Rrc.v/ D Rv; thus, Lv < Rv .

Part (4): Proof follows from Lemma 5.1 and Part (3).Part (5): From Lemma 5.1, Llc.vp1

/ < Llc.vp2/ < : : : < Llc.vpb

/ < Lrc.vp1/ < Lrc.vp2

/ < : : : < Lrc.vpb/

and Rlc.vp1/ < Rlc.vp2

/ < : : : < Rlc.vpb/ < Rrc.vp1

/ < Rrc.vp2/ < : : : < Rrc.vpb

/. Also, from Part (3),Lrc.vp1 / < Rrc.vp1 /. Since Lvi

D Llc.vi / and RviD Rrc.vi /, we have Lvp1

< Lvp2< : : : < Lvpb

< Rvp1<

Rvp2< : : : < Rvpb

.Part (6): We prove Part (a). The proof for Part (b) is similar. Suppose Lu D Lv and u is a descendent

of v. From Part (4), u belongs in the subtree model rooted at lc.v/. Also, from Part (2), Rlc.v/ < Rv .Applying these observations repeatedly, we infer that Ru < : : : < Rlc.lc.v// < Rlc.v/ < Rv ; thus Ru < Rv .For the converse, suppose Lu D Lv D x and Ru < Rv . From Part (4), both u and v lie on the unique pathfrom leaf sx to the root. Thus, either u is a descendent of v or v is a descendent of u. The possibility thatv is a descendent of u contradicts the assumption that Ru < Rv . Thus, we have that u is a descendentof v.

Encoding of nodes in TDM : Consider the associated phylogeny TDM of DM. Both DM and TDM de� nethe same evolutionary relationships on the elements in L ; that is, both of them induce the same topologyon any subset of elements in L . Therefore, we can de� ne a similar encoding for nodes in TDM . For nodev 2 TDM , let Sv and Gv denote the smallest and largest leaf indices, respectively, in the subtree rootedat v. We de� ne Enc.TDM / D f.Sv; Gv/jv 2 TDMg.

Lemma 5.2. Let DM be a duplication model consistent with L D fs1; s2; : : : ; sng, and let TDM be theassociated phylogeny of DM. Then Enc.DM/ D Enc.TDM /.

Page 14: Zinc Finger Gene Clusters and Tandem Gene Duplication

442 TANG ET AL.

Proof. The proof follows directly from Part (4) of Corollary 5.1. We note that Lv D Sv and Rv D Gv ,for every node v in DM.

Lemma 5.3. Let DM1 and DM2 be duplication models consistent with L D fs1; s2; : : : ; sng, and letTDM1 and TDM2 be their associated phylogenies, respectively. If Enc.DM1/ D Enc.DM2/, then TDM1 DTDM2 .

Proof. Suppose Enc.DM1/ D Enc.DM2/. From Lemma 5.2, this implies that Enc.TDM1/ D Enc.TDM2/.Part (1) of Corollary 5.1 implies that each node in TDM i has a unique LR pair from Enc.TDMi

/ .i D 1; 2/.Each node in TDM1 (and TDM2 ) can be represented by its LR pair. We will show that TDM1 and TDM2

are equal by showing that if the LR pairs of nodes v 2 TDM1 and u 2 TDM2 are equal, then the LR pairsof their children are also equal.

In TDM1 , let v be a node with LR pair .X; Y /. Let v1 and v2 be children of v with LR pairs .X; Rv1 /

and .Lv2 ; Y /, respectively. Next, consider the node in TDM2 with LR pair .X; Y/; let this node be u. Letu1 and u2 denote u’s children. Then, the LR pair of u1 is .X; Ru1 /, and the LR pair of u2 is .Lu2; Y /. Wewill show that Ru1 D Rv1 and Lu2 D Lv2 .

There are three cases to consider, namely, Rv1 < Ru1 , Rv1 > Ru1 , and Rv1 D Ru1 . We will show thatthe � rst two cases are not possible and thus Rv1 D Ru1 .

Suppose Rv1 < Ru1 . In TDM1 , consider the node w that has LR pair .X; Ru1/. From Part (6) of Corollary5.1, we see that node w is a strict ancestor of node v1 and therefore a strict ancestor of node v. Againapplying Part (6) of Corollary 5.1, we see that Rv < Ru1 . However, in TDM2 , since u1 is a descendent ofu, we have Ru1 < Rv. This is a contradiction. The case Rv1 > Ru1 can also be ruled out using a similarargument. Thus Rv1 D Ru1 .

Similarly, we can show that Lu2 D Lv2 .

We now show that the encoding scheme Enc de� ned above is unique; that is, different duplicationmodels cannot have the same encoding.

Lemma 5.4. Let DM1 and DM2 be duplication models consistent with L D fs1; s2; : : : ; sng. Thefollowing statement is true: Enc.DM1/ D Enc.DM2/ iff DM1 D DM2.

Proof. (Ã) This direction of the proof is obvious since each node has a unique LR pair. (!) FromLemma 5.3, Enc.DM1/ D Enc.DM2/ implies TDM1 D TDM2 . Assume DM1 6D DM2. Thus, there existstwo nonleaf nodes u and v, with LR pairs .Lu; Ru/ and .Lv; Rv/, respectively, that belong to the sameblock in one duplication model but not in the other. Without loss of generality, assume that u and v belongto the same block in DM1 with u appearing to the left of v. From the de� nition of a block, it follows

FIG. 8. The three possible ways in which a path from u to a leaf in u’s right subtree can cross with a path from v

to a leaf in v’s left subtree. Here, rc.u/ denotes the right child of u and lc.v/ denotes the left child of v. Node z1may or may not be equal to rc.u/; similarly, node z2 may or may not be equal to lc.v/.

Page 15: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 443

that the edge .u; rc.u// crosses the edge .v; lc.v// in DM1. Thus, from Lemma 5.1, Llc.v/ < Lrc.u/ andRlc.v/ < Rrc.u/. In addition, we have Lu < Lv < Ru < Rv ; this implies that in DM2 there is a path fromu to a leaf in u’s right subtree that crosses with a path from v to a leaf in v’s left subtree. However, inDM2, since u and v belong to different blocks, the edges .u; rc.u// and .v; lc.v// do not cross.

Figure 8 shows the three possible cases. In case (i), we have Lrc.u/ < Llc.v/. This is a contradiction.Similarly, in case (ii), Rrc.u/ < Rlc.v/ which is a contradiction. Finally, in case (iii), Lrc.u/ < Llc.v/

and Rrc.u/ < Rlc.v/ , which is a contradiction. Thus, in each of these cases, either Lrc.u/ < Llc.v/ orRrc.u/ < Rlc.v/, and we reach a contradiction.

Next we show that if two associated phylogenies are equal then their corresponding duplication modelsare also equal.

Theorem 5.1. Let DM1 and DM2 be duplication models consistent with L D fs1; s2; : : : ; sng. LetTDM1 and TDM2 be the associated phylogenies for DM1 and DM2, respectively. If TDM1 D TDM2 , thenDM1 D DM2.

Proof. TDM1 D TDM2

) Enc.TDM1 / D Enc.TDM2/

Now, Enc.DM1/ D Enc.TDM1 / and Enc.DM2/ D Enc.TDM2 /, from Lemma 5.2) Enc.DM1/ D Enc.DM2/

) DM1 D DM2, from Lemma 5.4.

We conclude this section by providing an algorithm for reconstructing DM from a given phylogeny T .If DM exists then, from Lemma 5.1, the statement about the generation list genh will be true for every0 · h · .MAX_HEIGHTC1/. Thus, we reconstruct DM by doing a top-down traversal of the nodes in T .Two minor points have to be addressed here. We assume that jEnc.T /j is equal to the number of nodesin T as otherwise, from Lemma 5.2 and Part (1) of Corollary 5.1, we conclude that the duplication modeldoes not exist. Also, every node in T can be preprocessed to identify its left child and right child. If v1

and v2 are the children of v, then v1 is v’s left child if Sv D Sv1 < Sv2 and Gv1 < Gv D Gv2 ; if for anynode, this test fails, then from Part (2) of Corollary 5.1 we conclude that the duplication model does notexist.

The algorithm starts at the root and at each iteration identi� es blocks and replaces the nodes in theblocks with their children in the appropriate order. This is accomplished by using list gen. A sublisthvij ; vij C1 ; : : : ; vij Cb

i of gen is identi� ed as a block only if the condition in Part (5) of Corollary 5.1holds for this sublist and Lemma 5.1 holds for the resulting modi� cation of gen. If at any stage gen 6D.s1; s2; : : : ; sn/ and it is not possible to identify a block, then the algorithm terminates and outputs thatthe duplication model does not exist. The correctness of the algorithm follows from Lemma 5.1. Figure 9describes the algorithm to construct the duplication model DM from T . The running time of the algorithmis O.n2/.

6. ANALYSIS OF ZNF45 FAMILY

The sixteen members of the ZNF45 family have different numbers of zinc � ngers (between 7 and 19).The amino acid sequences for the members were aligned using the Multiple Sequence Alignment programavailable through the Sequence Analysis Server at Michigan Tech (www.genome.cs.mtu.edu/map/map.html).The parameters used in the alignment were the following—amino acid substitution matrix: blosum62; gapopen penalty: 30; gap extension penalty: 2. This produced an alignment in which entire zinc � ngers werealigned against each other or against gaps. This alignment was used to construct a matrix of pairwisedistances between the members. The pairwise distance between two aligned sequences was obtained bytaking the Hamming distance between the two sequences and correcting it using a standard distancecorrection formula (for a Poisson model of sequence evolution) for protein sequences (Swofford et al.,1996). This distance matrix was given as input to the window method described in Section 4. The resultingduplication model for the 16 members of the ZNF45 family is shown in Fig. 10.

Page 16: Zinc Finger Gene Clusters and Tandem Gene Duplication

444 TANG ET AL.

FIG. 9. Description of the DM reconstruction algorithm. Comments are given between =¤::¤=.

This duplication model hypothesizes that the ZNF45 family has evolved mainly through single geneduplication events. Studies (Pavletich and Pabo, 1991; Desjarlais and Berg, 1992; El-Baradi and Pieler;1991) have shown that residues in certain key positions in a C2H2-type � nger bind to DNA. We note thatthe zinc � ngers of the ZNF45 family are quite diverse in the residues in these key positions. This suggestsa diversity in the DNA that the transcription factors encoded by these genes bind to. As noted by Shannonet al. (1998), the high degree of conservation of the KRAB domains coupled with the diversity in the zinc� ngers of the ZNF45 family members suggests that they may regulate different genes in a similar mannerduring development.

It has been suggested that families like ZNF45, which have members in a tandem array with evenspacing between the members, are gene batteries that have evolved in such a way that the family membersare coordinately expressed at various stages of development (Shannon et al., 1998; Arnone and Davidson,1997). An analysis of the upstream sequences for these genes could shed further light on this hypothesis.

7. CONCLUSION

In this paper, we described a graph-theoretic structure called a duplication model to study the evolutionaryrelationships of members of gene families that have evolved through in situ gene duplication events. Thereare several extensions to this work. Statistical inference techniques (like maximum likelihood estimation)

Page 17: Zinc Finger Gene Clusters and Tandem Gene Duplication

TANDEM GENE DUPLICATION 445

FIG. 10. The duplication model for the ZNF45 family. The left-to-right order of the above genes is the same as theirordering on the chromosome.

can be incorporated to construct duplication models. Another extension is to consider gene deletion events(which we do not presently consider) in the evolutionary model.

ACKNOWLEDGMENTS

We are very grateful to Elbert Branscomb for suggesting this set of problems and to Elbert Branscomband Lisa Stubbs at the DOE Joint Genome Institute for providing us with the sequences for the membersof the ZNF45 family and for valuable discussions. S.Y. acknowledges support by a Fellowship from theProgram in Mathematics and Molecular Biology at the Florida State University, with funding from theNational Science Foundation under Grant No. DMS-9406348.

REFERENCES

Arnone, M., and Davidson, E. 1997. The hardwiring of development: Organization and function of genomic regulatorysystems. Development 124, 1851–1864.

Ashworth, L., Batzer, M., Brandriff, B., Branscomb, E., deJong, P., Garcia, E., Garnes, J., Gordon, L., Lamerdin, J.,Lennon, G., Mohrenweiser, H., Olsen, A., Slezak, T., and Cerrano, A. 1995. An integrated, metric physical map ofhuman chromosome 19. Nature Genetics 11, 422–427.

Benson, G., and Dong, L. 1999. Reconstructing the duplication history of a tandem repeat. In 7th Int. Conf. IntelligentSystems for Molecular Biology, 44–53.

Desjarlais, J., and Berg, J. 1992. Towards rules relating zinc � nger protein sequences and DNA binding site preferences.Proc. Natl. Acad. Sci. USA 89, 7345–7349.

El-Baradi, T., and Pieler, T. 1991. Zinc � nger proteins: What we know and what we would like to know. Mechanismsof Development 35, 155–169.

Fitch, W. 1971. Toward de� ning the course of evolution: Minimum changes for a speci� c tree topology. SystematicZoology 20, 406–16.

Hartigan, J. 1973. Minimum mutation � ts to a given tree. Biometrics 29, 53–65.

Page 18: Zinc Finger Gene Clusters and Tandem Gene Duplication

446 TANG ET AL.

Hoovers, J., Mannens, M., John, R., Bliek, J., van Heyningen, V., Porteous, D., Leschot, N., Westerveld, A., and Little,P. 1992. High-resolution localization of 69 potential human zinc � nger protein genes: A number are clustered.Genomics 12, 254–263.

Huson, D., Nettles, S., Rice, K., Warnow, T., and Yooseph, S. 1999. Hybrid tree reconstruction methods. ACM J.Experimental Algorithmics, 4.

Jaitly, D., Kearney, P., Lin, G., and Ma, B. 2001. Methods for reconstructing the history of tandem repeats and theirapplication to the human genome. Unpublished manuscript.

Li, W., and Graur, D. 1991. Fundamentals of Molecular Evolution, Sinauer Associates, Inc.Margolin, J., Friedman, J., Meyer, W., Vissing, H., Thiesen, H.-J., and Rauscher III, F. 1994. Kruppel-associated boxes

are potent transcriptional repression domains. Proc. Natl. Acad. Sci. USA, vol. 91, 4509–4513.Miller, J., McLachlan, A., and Klug, A. 1985. Repetitive zinc-binding domains in the protein transcription factor IIIA

from Xenopus oocytes. EMBO J. 4, 1609–1614.Nei, M. 1975. Molecular Population Genetics and Evolution, North-Holland, Amsterdam.Ohno, S. 1970. Evolution by Gene Duplication, Springer-Verlag, Berlin, New York.Papadimitriou, C., and Yannakakis, M. 1991. Optimization, approximation, and complexity classes. J. Computer Syst.

Sci. 43, 425–440.Pavletich, N., and Pabo, C. 1991. Zinc � nger-DNA recognition: Crystal structure of a Zif2680-DNA complex at 2.1

Å. Science 252, 809–817.Saitou, N., and Nei, M. 1987. The neighbor-joining method. Mol. Biol. Evol. 4, 406–425.Shannon, M., Ashworth, L., Mucenski, M., Lamerdin, J., Branscomb, E., and Stubbs, L. 1996. Comparative analysis of

a conserved zinc-� nger gene cluster on human chromosome 19q and mouse chromosome 7. Genomics 33, 112–120.Shannon, M., Kim, J., Ashworth, L., Branscomb, E., and Stubbs, L. 1998. Tandem zinc-� nger gene families in

mammals: Insight and unanswered questions. DNA Sequence—The Journal of Sequencing and Mapping 8, 303–315.

Sokal, R., and Michener, C. 1958. A statistical mapping for evaluating systematic relationships. Univ. Kansas Sci.Bull. 28, 1409–1438.

Swofford, D., Olsen, G., Waddell, P., and Hillis, D. 1996. In Molecular Systematics chapter Phylogenetic Inference,Sinauer Associates.

Wang, L., and Gus� eld, D. 1997. Improved approximation algorithms for tree alignment. J. Algorithms 25, 255–273.Wang, L., Jiang, T., and Lawler, E.L. 1996. Approximation algorithms for tree alignment with a given phylogeny.

Algorithmica 16, 302–315.

Address correspondence to:Shibu Yooseph

Celera Genomics45 West Gude Drive

Rockville, MD 20850

E-mail: [email protected]