bioinformatic analysis of mutation and selection in the vertebrate …170790/... · 2009-02-14 ·...

62
ACTA UNIVERSITATIS UPSALIENSIS UPPSALA 2007 Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 348 Bioinformatic Analysis of Mutation and Selection in the Vertebrate Non-coding Genome MIKAEL BRANDSTRÖM ISSN 1651-6214 ISBN 978-91-554-6981-8 urn:nbn:se:uu:diva-8240

Upload: others

Post on 16-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

ACTA

UNIVERSITATIS

UPSALIENSIS

UPPSALA

2007

Digital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 348

Bioinformatic Analysis of Mutationand Selection in the VertebrateNon-coding Genome

MIKAEL BRANDSTRÖM

ISSN 1651-6214ISBN 978-91-554-6981-8urn:nbn:se:uu:diva-8240

Page 2: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

���������� �������� �� ������ �������� � �� �������� ������� � ��������� ������������� �� ������� ��������� ����� !"� #$$% �� $�&$$ '� �(� ������ ' ���� ')(����(�* +(� �������� ,��� �� ������� � �����(*

��������

�������-�� .* #$$%* ���'������ /������ ' .����� �� 0������ � �(� ����������12���� 3���* /��� ����������� ���������* ������� ��� � ���� ����� � �������� ���� ������� �� �� ������� � ��� �� ��� � ������ !45* 6$ ��* ������*70�1 �%52�"288426�5"25*

+(� ��9���� ' �(� ���������� ���� ��:���� �� � ���� '� ������* 7 ����� ������ �(������� ' �(�� ���� '����� ' �(� ���� (�� ����� �������* +(��� ������� (������ ������� '���������� �� �(� ������������ ' '��� ���� ��:�����* +(� ��� ' �(�� �(���� ��� ����� ������ ' �(� ���� ���������� ���� �(���( ���'������ ������� '�����2����� ����� ��������*7 � '���� ������� ,� ��������� �(� ��� ' �������� ' ��:���� ���,�� (��(��

�������� ����� � �'�� '����* �� ������� ������� '� � ������ ' �(� ������� '������� �������* ;���� �������� ' ������� ����� �������� ' �������� ��������(� ������������ �(� '������ ������ ' �(� ����*7 �(� '��,�� ������� ,� '����� ����( ������� �� '�� � �����2����� ������

�� ������ <����= ������(���� �� ��������������* >� ����� � �(��?�� ����������������� �� � ��?��� ������ ���(����� �� � ����� ������ ' �(� ����� ��� ����� '�����2����������* ��� � ��� ' ������������� ������(���� � �(��?�� ,(��� ,� ��������������� ����� ,� �(,�� �(�� ������(��� �� ��������� ��������� ,��( �����������������( �� /+2����* >���(������ ���������� � �(� ������������� ��:���� �������� �(������� ' ������(���*�� ��� ������� �(� �������� ���,�� ������������� ������(��� �� ���������

� �(� (��� ����* ;��� ,� '�� �������� ������ ' ������������� ������(��� �(��� ��������� (����� �� ��� ������� �������� � �(� '��:������ ' ������������� ������(���� <01)�= �� �����* +(�� ���� �,���� ������ ������� �(�����(� ������ ' �������* /������������ ��������� �� �������� '� ��� �(��� ?��� '������(����*>������ 7 ������ �(� ������ 7@/)��* 7� �� � �� '� ������������ �������� �� ����

�������� ����� �@/0+* �� ������ ������� (��(���(� �(� �������� ������ ���,�� ��������� �(����*

7� ��� ���(������ �(� �������� ' ����( ����������� � ���� ������� �� ,��� �� �(�������� ��''����� ���,�� ����� �� ��������������*

� ������ 2���� ����� �������� ������� �(��?�� (���� ����� ����������������:���� �������� ���������� ������� ���'�������� ����������

���� � �������!" � ���� �� � #������" $ ���� ��� ���� �����" %�&��' ()�"������� ���� �����" �#*+,-./ �������" �� � �

A .�?��� �������-� #$$%

7001 "68"26#"470�1 �%52�"288426�5"25��&�&��&��&����25#4$ <(���&BB��*?�*��B������C��D��&�&��&��&����25#4$=

Page 3: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

List of papers

The thesis is based on the following papers, hereafter referred to by their roman numerals: I. Smith, N. G. C., Brandström, M., Ellegren, H. 2004. Evidence for

turnover of functional noncoding DNA in mammalian genome evolution. Genomics 84:806-813.

II. Brandström, M., Ellegren, H. 2007. The genomic landscape of

short insertion and deletion polymorphisms in the chicken (Gallus gallus) genome: A high frequency of deletions in tandem dupli-cates. Genetics 176:1691-1701.

III. Brandström, M., Bagshaw, A. T., Gemmell, N., Ellegren, H. Mi-

crosatellite polymorphism is elevated in human recombination hotspots. Manuscript.

IV. Brandström, M., Ellegren, H. A genome-wide analysis of mi-

crosatellite polymorphism circumventing the ascertainment bias. Manuscript.

V. Brandström, M. ILAPlot: An interactive local alignment plotting

tool. Manuscript. Paper I and II are reproduced with permission from the publisher.

Page 4: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

Printed on Lessebo bok 100g. Cover printed on Kaskad 225g. Front cover: The mosaic of variation in the chicken genome.

Page 5: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

Contents

Introduction.....................................................................................................9

Comparative genomics..................................................................................11 Comparing the appropriate sequences......................................................11 Bioinformatics..........................................................................................13

The non-coding genome ...............................................................................15 Single copy DNA .....................................................................................15

Selection on the non-coding genome...................................................16 Positive selection on non-coding sequences........................................17 Function of non-coding sequences ......................................................17

Repetitive sequences ................................................................................18 Microsatellites .....................................................................................19

Sequence polymorphism...............................................................................21 Small scale sequence length polymorphisms ...........................................22

Short insertion and deletion polymorphisms .......................................23 Microsatellite polymorphisms .............................................................24

Research aims ...............................................................................................25

Summaries of papers.....................................................................................26 Paper I: Evidence for turnover of functional noncoding DNA in mammalian genome evolution .................................................................26 Paper II: The genomic landscape of short insertion and deletion polymorphisms in the chicken (Gallus gallus) genome: a high frequency of deletions in tandem duplicates .............................................................29 Paper III: Microsatellite polymorphism is elevated in human recombination hotspots ............................................................................32 Paper IV: A genome-wide analysis of microsatellite polymorphism circumventing the ascertainment bias ......................................................34 Paper V: ILAPlot: An Interactive Local Alignment Plotting Tool ..........37

Summary in Swedish ....................................................................................39

Acknowledgement ........................................................................................44

References.....................................................................................................45

Page 6: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological
Page 7: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

Abbreviations

BAC Bacterial artificial chromosome bp Basepair(s) C.I. Confidence Interval Gb Gigabasepair Indel Insertion/Deletion kb Kilobasepair LINE Long Interspersed Nuclear Element LTR Long Terminal Repeat Mb Megabasepair SINE Short Interspersed Nuclear Element SNP Single Nucleotide Polymorphism

Page 8: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological
Page 9: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

9

Introduction

Seeing all the biodiversity surrounding us, it is hard to refrain from asking what processes are involved in forming it. We see differences between spe-cies, but also obvious differences within species. Ultimately, it is the ge-nome, and interactions between the genome and the environment, that will determine the phenotype. The importance of the genome has directed re-search interest towards understanding evolution on the molecular level. What parts are important? How does DNA sequence mutate? What forces shape the patterns of variation we see? Reading and decoding the genome reveals answers to these questions.

Since the advent of DNA sequencing the amount of available sequence data from various organisms has skyrocketed. For example, at the time of publication of the human genome (International Human Genome Sequencing Consortium 2001), there were roughly 10 gigabasepairs (Gb) of sequence deposited in the traditional divisions of the public databases GenBank, EMBL and DDBJ. Today, the same database divisions contain more than 100 Gb of sequence data (http://www.ncbi.nlm.nih.gov/Genbank/). In addi-tion to this, there are 125 vertebrate genomes that, at time of writing this, have been sequenced to at least the state of draft assembly (http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj).

In addition to the genome projects several related projects produce vast amounts of data in the quest to answer various medical or evolutionary ques-tions. A few examples of these include the HapMap project aimed at devel-oping a detailed haplotype map of the human genome (International Hap-Map Consortium 2005); the NISC comparative vertebrate sequencing initia-tive aimed at sequencing a set of orthologous target regions in a wide range of vertebrates (Thomas et al. 2003); and the ENCODE project aimed at pro-viding an encyclopaedia of all functional DNA elements in the human ge-nome (The ENCODE Project Consortium 2007). The data sets produced by projects such as these provide an important resource for the evolutionary biologist, as many questions, other than the initial questions of the data pro-ducers, can be addressed. Furthermore, the availability of these data sets has prompted the development of new computerised methods for the analysis of data.

The work presented in this thesis is based on data obtained from a number of sources of large genomic data sets. I have applied different in silico meth-ods to address evolutionary aspects of the non-coding vertebrate genome.

Page 10: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

10

The organisms of interest have either been human, placental mammals or chicken. Much of the discussion in this introduction is human-centric by virtue of the human genome being one of the most well studied. However, most of the described phenomena and patterns seem to be shared among vertebrates. I will also cover bird-specific aspects, in some detail, given that special emphasis has been placed on the analysis of the non-coding chicken genome.

Page 11: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

11

Comparative genomics

Vertebrate comparative genomics, as known today, came to a start when the mouse genome was sequenced, and the two genomes of human and mouse were available (International Human Genome Sequencing Consortium 2001; Mouse Genome Sequencing Consortium 2002). The term “comparative ge-nomics” is commonly used to denote studies where full genome sequences, full gene sets or at least significant parts thereof are used (Filipski and Kumar 2005; Hardison 2003). The addition of new genomes continuously sheds light on new aspects on the evolution of the genome.

Evolutionary biologists have since Darwin used comparative methods to understand trait evolution. The application of the comparative method to whole genomes can be traced back to the advent of DNA-binding dyes, which made studies of karyotypes and chromosome banding possible. With these methods, it could be seen that, although the number and layout of chromosomes differed, the overall variation in genetic content among mam-mals was not as pronounced as initially hypothesised (Filipski and Kumar 2005). It became apparent that it was more a matter of reshuffling of regions within the different genomes. The term comparative genomics seems to have appeared in the early 1990’s (Charlebois and St Jean 1995; Lyons et al. 1994; Womack and Kata 1995). At that time it was mostly used in conjunc-tion to comparative mapping of genes, and also high level comparative cyto-logical studies like chromosome painting (Wienberg and Stanyon 1995; Womack and Kata 1995).

Comparing the appropriate sequences An underlying assumption of comparative genomics is the homology of studied sequences. In other words, that the sequences studied have a com-mon ancestry. More specifically, the sequences must have diverged as a result of speciation, they must be orthologous. If a sequence is duplicated within a genome, before speciation, then the two copies are said to be paralogous (Figure 1). The use of paralogous sequences in a comparative study would void any inferences on species-specific traits, as one cannot distinguish between what has occurred before or after the species-split.

Page 12: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

12

During evolution, genomes undergo rearrangements like inversions, du-plications and deletions on various scales (Kent et al. 2003; Ma et al. 2006). All this shuffling of genetic material increases the difficulties of determining orthology of regions in the genome. Much research and development has been spent on developing methods to address the problem of asserting orthology. I will just briefly mention two approaches to it that are important for this thesis.

The first approach is when large genomic sequences are used in compara-tive studies, such as in Paper I and the chicken-turkey comparisons of Paper II. Here pairwise or multiple alignments are created through the use of lo-cally anchored alignment procedures like BLASTZ (Schwartz et al. 2003), MAVID (Bray and Pachter 2003; Bray and Pachter 2004) or MLAGAN (Brudno et al. 2003). Whole genome alignments can be done with BLASTZ (Schwartz et al. 2003) or the chaining and netting method presented by Kent et al. (2003). However, when shorter sequences, like fully sequenced BACs, are to be aligned, one needs to have some knowledge of homology, which often can be obtained through homology searches like BLAST (Altschul et al. 1997). When searching with a long query sequence against a genomic database, homologous regions can often be identified as a large number of consecutive search hits. The program presented in Paper V aids this semi-manual determination of homology by parsing and visualising the BLAST search results, and cutting out sequences of interest.

The other approach is when using large sets of short sequences, like reads from sparse shotgun sequencing for polymorphism discovery or EST sequences. The most common method is to use a fast homology search method, like BLAST/MegaBLAST (Altschul et al. 1997; Zhang et al. 2000)

c d a b

Gene duplication Speciation

Figure 1. Orthology versus paralogy. Sequences a and c are orthologous, while a and b, as well as a and d are paralogous.

Page 13: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

13

or SSAHA (Ning et al. 2001) to place the sequences onto a reference ge-nome. The reference genome may be the same species or a related species, depending on the question of interest. The obvious problem with this method is placement of sequences matching to repeated sequences like repeat ele-ments or genes within gene families. There are two major ways of handling this problem. The first is to require the hits to be of at least a defined quality or statistical significance level. However, with only this criterion, some se-quences may match to many places with similar scores, and to avoid using the incorrect placement, a second criterion is often applied. This criterion is that the best hit must have at least a predefined advantage in score to the second best hit, or that there is only a single hit. If this separation is not ful-filled, the sequence is discarded (e.g. International Chicken Polymorphism Map Consortium 2004). The other common way of handling the risk of paralogy is to use reciprocal best matches between two sets of sequences (e.g. Mouse Genome Sequencing Consortium 2002). This method ensures that the two sequences grouped as orthologs are the most similar sequences. This is a reasonable assumption, given that both sets of sequences are com-plete. However, if one of the sets is incomplete, there is a risk of incorrectly grouping sequences as orthologs.

Bioinformatics Bioinformatics, the collection, management and analysis of biological data using computers, has seen a rapid development in the last decade. Two ma-jor forces have been driving this: the rapid increase in the availability of biological data, as seen from the many large scale genomics projects, and the rapid development of computer power, with roughly doubling in capacity every two years (known as Moore's law. Moore 1965). In fact, current day’s genome sequencing projects would not have been possible without computer aided data management.

The large datasets also pose computational challenges in the analysis of data. Classical computer programs for the study of molecular evolution are now mature. They often provide easy to use graphical “point and click” in-terfaces (eg. Rozas et al. 2003; Tamura et al. 2007). However, for compara-tive genomic analysis, we have not yet seen the move to desktop programs, although this will likely be the case once full genome sequencing becomes affordable for a larger number of research groups. There are some excep-tions to this, notably programs to perform full genome analysis of specific phenomena, for example the microsatellite screening and analysis program SciRoKo (Kofler et al. 2007).

On the other hand, there are nowadays a large number of web-based re-sources for the analysis of genomic data. For genomic sequence and gene data, as used in this thesis, GenBank (http://www.ncbi.nlm.nih.gov/), En-

Page 14: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

14

sembl (http://www.ensembl.org/) and University of California at Santa Cruz Genome Browser (http://genome.ucsc.edu) have been the major source for data. These web sites offer powerful ways of exporting data sets, based on various selection criteria including homology between species. They also provide ways of overlaying local data onto graphical presentations of the genome. There are also numerous other web resources focused on different aspects of comparative genomics, enabling researchers to do quite elaborate studies without having to resort to in-house programming solutions (e.g. the Vista tool set, http://genome.lbl.gov/vista/).

When there are no off-the-shelf computer programs or web-applications available for the analysis to be done, the only solution is to set up a frame-work or pipeline locally. It is common practice to use available programs for the various sub-tasks within an analysis. For instance, to use the PAML package (Yang 1997) to estimate divergence and testing various models, or to use various alignment or homology search programs. For other parts of the analysis custom computer programs have to be written. This is often also the case for data flow between the different parts of the pipeline, as well as data management, extraction and statistical testing.

There are many different programming languages available and suitable for bioinformatic projects. Among these, the scripting language Perl is likely the most commonly used language for programs written with the sole intent of performing a single task within a project. There are, however, several other languages available with similar properties, e.g. powerful string han-dling and easy writing. Python and Java are good examples. For all these languages there are extensions available to simplify many tasks, like data input and output, and parsing of similarity search results to mention a few. In the Perl case the module set is called BioPerl (Stajich et al. 2002), and for Python and Java there are BioPython (http://www.biopython.org) and Bio-Java (http://www.biojava.org/), respectively.

Even though most parts of a comparative genomics analysis can be per-formed using ordinary desktop computers, there are still many tasks that are limited by the extreme computational times required. Typically these tasks are homology searches or alignments, which have to be run on super com-puters or large clusters of computers.

Page 15: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

15

The non-coding genome

One of the many discoveries coming through the sequencing of the human genome was the small number of genes actually found. The initial predic-tions, based on the sequence, pointed towards somewhere around 30,000 protein coding genes, although the total number of gene products is consid-erably higher due to alternative splicing (International Human Genome Se-quencing Consortium 2001). This figure reappeared at the sequencing of other vertebrate genomes, including the mouse and the chicken, both with similar number of protein-coding genes (International Chicken Genome Se-quencing Consortium 2004; Mouse Genome Sequencing Consortium 2002). It seems that the gene set is rather constant among most vertebrates, although the genome size varies significantly (Gregory 2005a). The set of protein-coding genes makes up only a few percent of the genome. The rest can be divided into a number of different categories, all of which I refer to as non-coding DNA, on the basis of them not coding for proteins (Table 1) (Filipski and Kumar 2005; Gregory 2005a; Lynch 2007). I will discuss some of these DNA classes in more detail below.

Table 1. The different major classes of DNA in the human genome.

Structure of DNA DNA Class Proportion of genome

Non-repetitive Protein coding sequences 2% ”Conserved non-coding sequences” 3% Other unique DNA 46% Repetitive Interspersed repeats 4% Tandem repeats 5%

Single copy DNA Roughly half of the human genome can be classified as single copy non-coding DNA, i.e. unique sequence occurring only at one place in the genome (Filipski and Kumar 2005; International Human Genome Sequencing Con-sortium 2001). More than half of this sequence is introns (Lynch 2007). Since the discovery of introns, and even more so, when it was realised that

Page 16: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

16

significant parts of the genome were intergenic sequences, there has been ongoing efforts to try to understand the role of the non-coding genome. Much of these efforts have focused on evolutionary conserved regions, as sequence conservation is often regarded an indicator of functional constraint. The recently published, and to date most elaborate study of the non-coding genome is the ENCODE project (The ENCODE Project Consortium 2007). The ENCODE project confirmed earlier results suggesting that evolutionary conserved regions provide important functions in the genome (Drake et al. 2006; Margulies et al. 2003; Woolfe et al. 2005). However, a surprising re-sult of ENCODE is that many seemingly unconstrained elements, show evi-dence of being biochemically active (The ENCODE Project Consortium 2007).

Selection on the non-coding genome Early this decade, when enough sequence data had accumulated through genome sequencing projects, evidence for wide spread negative selection on the non-coding genome was presented from different interspecies compari-sons (Bergman and Kreitman 2001; Loots et al. 2000; Shabalina et al. 2001). In pairwise or multispecies comparisons, regions under negative selection are expected to show a lower degree of sequence divergence, as fewer new mutations will reach fixation due to them being (slightly) deleterious. The simplicity of the method, to just scan genomic alignments for conservation, has probably contributed to its appeal.

Based on alignments of human and mouse, the proportion of the genome conserved by negative selection was estimated to about 5%. Excluding the 1.5% of protein coding sequence leaves 3.5% of the genome as non-coding sequences conserved through negative selection. Similar figures are obtained when more genomes have added to the alignments (e.g. Thomas et al. 2003).

The use of sequence conservation as criterion for detection of functional regions is not without limitations. For example, the results are much affected by the alignment methods used. This becomes especially important with sequences where the divergence approaches saturation of substitution (Bergman and Kreitman 2001; Clark 2001). Furthermore, in a two-species comparison it is not possible to distinguish if conserved regions are the result of constrained evolution, or simply natural low mutation rate. Adding a third species can to some extent resolve the issue, as it is less likely that regions in three divergent species are conserved by chance (Dubchak et al. 2000). However, this requires divergent species as variation in mutation rate corre-lates between species (Smith et al. 2002). Interspecies comparisons tend to produce quite noisy results, and several methods have been proposed to make the analysis more robust (Elnitski et al. 2003; Siepel et al. 2005). One has also to be cautious when choosing species to ensure sufficient diver-

Page 17: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

17

gence to gain power, although there is a trade-off between power in detec-tion and reliability of the alignments.

Later, when the chicken genome was added to the multispecies mammal-ian comparisons, a core set of ultra conserved regions among vertebrates could be defined. Excluding protein coding genes, this set covers about 1.25% of the human genome (International Chicken Genome Sequencing Consortium 2004).

Drake et al. (2006) used polymorphism information from the International HapMap Consortium (2005) to study polymorphism levels in conserved non-coding sequences. Derived alleles were significantly less common in these regions, indicating selective constraints, rather than reduced mutation rate (Drake et al. 2006). This and other studies indicating conservation by selec-tion has spurred research in what the functions of these regions are.

Positive selection on non-coding sequences With such clear evidence of negative selection in non-coding sequence, an obvious question is whether there are regions under positive selection as well? In recent years a number of studies have provided evidence for posi-tive selection. For example Andolfatto (2005) showed that positive selection is acting on non-coding regions of in Drosophila. However, for hominids evidence for positive selection is still lacking, although it is possible that this could be due to lack of power in the tests for selection (Eyre-Walker 2006). Andolfatto (2005) contrasted polymorphism levels to divergence in the search for positive selection. Pollard et al. (2006), on the other hand, con-trasted levels of divergence within a phylogeny to find regions conserved among mouse, rat and chimpanzee, but with accelerated divergence after the human-chimpanzee split. Their argument is that these human accelerated regions are regions of adaptive evolution in humans. However, adaptive evolution is not the only possible explanation of the human acceleration, as it has been suggested that the rapid divergence in humans might be due to bi-ased gene conversion in recombination hot-spots (Galtier and Duret 2007).

Function of non-coding sequences Already before the sequencing of the human genome, there was extensive knowledge of many kinds of RNA genes where the final product is not a protein. These include the many RNAs with enzymatic functions, and also involved in the translational machinery like tRNA, ribosomal RNA, snRNA (small nuclear RNA, part of the spliceosome) (reviewed in Eddy 1999). However, most of these non-coding RNA genes have been discovered through experimental methods.

The newly discovered conserved non-coding sequences has spurred re-search into what their potential function may be. Early studies showed that

Page 18: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

18

some of the sequences had cis-regulatory functions (Bergman and Kreitman 2001; Loots et al. 2000). Other functions that has become apparent is tran-scription factor binding (Margulies et al. 2003) and developmental regula-tion (Woolfe et al. 2005). However, the ENCODE project (2007) found that 40% of the inter-species constrained bases were still un-annotated, and func-tion not yet determined.

Repetitive sequences About half the human genome consists of non-unique DNA sequences that are repeated in one way or the other (International Human Genome Sequenc-ing Consortium 2001). The amount of repetitive sequence varies between genomes, for example the more compact chicken genome contains only about 15% repetitive sequence (International Chicken Genome Sequencing Consortium 2004). The repetitive sequences can be divided into different classes, which I will briefly describe below (Table 2). Particular emphasis will be devoted to microsatellites, as this specific class of sequences is one of the main topics of interest in this thesis.

Table 2. Different classes of repetitive sequences in the human genome

Sequence type Proportion of genome

Interspersed repeats

Long Interspersed Nuclear Elements (LINEs) 20%

Short Interspersed Nuclear Elements (SINEs) 13%

LTR retrotransposons 8% DNA transposons 3% Tandem repeats Satellites < 1% Minisatellites < 1% Microsatellites 4%

The major dichotomy when classifying repetitive sequences is between

interspersed repeats and tandem repeats. Interspersed repeats are sequences repeated on different places within the genome. The majority of the repeti-tive sequences in the human genome (almost 50% of the genome) are inter-spersed repeats (International Human Genome Sequencing Consortium 2001). Mechanisms exist for most of these sequences that, in one way or the other, can replicate the sequences within the genome. Some of the motifs can be seen as selfish DNA elements, as they, themselves, contain all functions necessary for replication, e.g. LINEs. The transposable elements can be fur-ther divided into two classes based on if the transposition includes transcrip-tion to RNA followed by reverse transcription to DNA before insertion, or if

Page 19: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

19

it is purely DNA based (Finnegan 1989). The first class is the most abundant in human, but also in chicken (International Chicken Genome Sequencing Consortium 2004; International Human Genome Sequencing Consortium 2001). In the human genome, L1 LINEs (Long Interspersed Nuclear Ele-ments) are the most common with regard to total sequence content. The L1 elements account for about 16% of the total genome sequence, whereas LINEs together account for roughly 20% of the genome. By numbers SINEs (Short Interspersed Nuclear Elements) are the most abundant, with Alu ele-ments being most common (International Human Genome Sequencing Con-sortium 2001). Endogenous retroviruses also belong to this first class of transposable elements. The second class, DNA-elements, is much smaller and accounts for only about 3% of the genome (International Human Ge-nome Sequencing Consortium 2001).

Interspersed repeats can pose great problems in homology searches in comparative analyses. Specifically, the high similarity between copies makes it impossible to assess orthology. The most common way of handling this is to mask the repetitive sequences before homology searches using programs such as RepeatMasker (Smit, A., unpublished, http://www.repeat-masker.org/). On the other hand, interspersed repeats can provide a good neutrally evolving marker for the study of regional variation in mutation patterns, where changes from the ancestral sequence of the repeat can be assessed (International Human Genome Sequencing Consortium 2001; Web-ster et al. 2006; Webster et al. 2005).

Tandem repeats, the other part of the dichotomy mentioned above, are se-quences repeated in tandem head to tail, or head to head (inverted repeats). These repeats are mostly classified by their size, with the long (> 100 bp per unit) repeats referred to as satellite DNA (Corneo et al. 1967), a shorter group of repeats between 10 and 30 bp repeat units called minisatellites (Jeffreys et al. 1985), and the shortest repeat units, up to five or six bp called microsatellites (Litt and Luty 1989; Weber and May 1989). Satellite DNA is mostly found as tandem-repeated about 170 bp long monomers known as alpha-satellite repeats in the centromeres. Other tandem repeats are spread throughout the genome, with the telomeres as a noticeable other region rich in repeats (International Human Genome Sequencing Consortium 2001).

Microsatellites It was early found that the occurrence of microsatellites in the genome is much higher than would be expected by pure chance (Hamada and Kakuna-ga 1982; Hamada et al. 1982; Tautz and Renz 1984; Tautz et al. 1986). For example, the Human Genome Sequencing Consortium (2001) estimated that 3% of the human genome comprise microsatellite sequence. However, the microsatellite content varies between species, with a positive correlation

Page 20: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

20

between genome size and microsatellite number (Dieringer and Schlötterer 2003; Ellegren 2004; Toth et al. 2000).

An intriguing aspect of microsatellite occurrence is the differences in prevalence of different motifs, including differences within repeat unit length classes. For example, CAn is the most common dinucleotide repeat motif in the human genome, accounting for roughly 50% of the dinucleotide repeats found, while GC repeats account for less than a percent (International Human Genome Sequencing Consortium 2001). Similar skews are seen for other repeat unit length classes (International Human Genome Sequencing Consortium 2001). The skewness tends to be different among different species (Dieringer and Schlötterer 2003; Toth et al. 2000).

The distinction between microsatellites and minisatellites can seem arbi-trary. Satellite DNA got its name from the observation that it formed a satel-lite band in caesium-chloride density gradient separation experiments (Corneo et al. 1967). When a smaller class of repeats was found, it was dubbed minisatellites (Jeffreys et al. 1985), and consequently the smallest sized repeats were called microsatellites (Litt and Luty 1989; Weber and May 1989). The mutation process is distinctly different between minisatel-lites and microsatellites, although it for both types of repeats involves the addition or loss of repeat units. However, whereas minisatellites mutate through unequal crossing over or gene conversion at recombination (Berg et al. 2003), microsatellites mutate through replication slippage (Levinson and Gutman 1987; Schlötterer and Tautz 1992). Replication slippage involves the dissociation and shifted reassociation of the DNA strands during replica-tion and leads to the addition or loss of repeat units.

As for many of the classes of non-coding DNA, the question whether some function can be attributed to microsatellites has received quite some attention. For instance, microsatellites within genes have been well studied, including examples of microsatellite associated disease involving trinucleo-tide expansions (Li et al. 2004; Lindblad and Schalling 1999). There are also examples from in vitro studies of microsatellites being involved in gene regulation (Contente et al. 2002; Struhl 1985). Furthermore, there is an ex-ample of a microsatellite affecting social behaviour in a mole species (Hammock and Young 2004). It has also been debated whether microsatel-lites are involved in recombination, either as recombination signals, or with recombination being mutagenic to microsatellites. There is some experimen-tal evidence for the latter (Hile and Eckert 2004). However, most current evidence points towards this not being the case (Morral et al. 1991), with the strongest evidence from the non-recombining Y chromosomes, which has microsatellite mutation rates similar to autosomes (Kayser et al. 2000).

Page 21: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

21

Sequence polymorphism

Polymorphisms are generally considered to be naturally occurring variation between individuals within a population, or between populations within a species. Some polymorphisms are seen as phenotypic differences between individuals, but most are just manifested as differences in the primary DNA sequence, without a noticeable phenotypic effect, or even being selectively neutral. Intraspecific polymorphism is also denoted diversity, as opposed to interspecific divergence. The differences in the DNA sequence between in-dividuals can occur on many scales, ranging from large-scale segmental duplications down to single bp differences.

The amount of genetic diversity seen varies on many different levels. For instance there are interspecies differences, where some species have high levels of genetic diversity while other species have low, e.g. higher in chicken than in humans (International Chicken Polymorphism Map Consor-tium 2004). There is also within genome variation between chromosomes (International Chicken Polymorphism Map Consortium 2004) or even at a finer levels (Berlin et al. 2006; Gaffney and Keightley 2005; International Chicken Polymorphism Map Consortium 2004; Spencer et al. 2006). There are two basic forces governing the levels of genetic diversity seen: mutation and selection. Mutation governs the emergence of new polymorphisms, whereas selection governs the fate of mutations once they occur.

New mutations occur continuously, for instance as failed repair of errors at replication of the DNA, at recombination or as lesions due to extrinsic factors. If a new mutation is neutral only genetic drift will govern its fate, and the fixation rate is determined by the effective population size. In this case, given that the effective population size is constant, the variation in the levels of genetic diversity will reflect the underlying mutation rate variation, (Kimura 1968; Li 1997). On the other hand, if there is selection on a muta-tion, either positive or negative, the new mutation will be more rapidly lost or fixed than under neutrality. However, in this process, linked loci will also be affected, and as a consequence the levels of polymorphism around a se-lected site will be reduced. In the case of a locus with a deleterious allele, selection will reduce the frequency of the allele, and as a consequence the frequency of neutral alleles at linked loci will be reduced. This process is called background selection (Charlesworth et al. 1993). In the case of posi-tive selection on an advantageous allele, the allele will rapidly increase in frequency through positive selection. In this case neutral alleles on linked

Page 22: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

22

loci will be dragged along and also increase in frequency through genetic hitch-hiking (Kaplan et al. 1989; Maynard Smith and Haigh 1974). Thus, in both cases, the levels of recombination will also affect the levels of genetic diversity seen in a population, as recombination rate determines how far from the selected site effects of background selection and hitch-hiking will be seen.

What are the consequences of the above on the levels of genetic diversity within a genome? To mention a few: Firstly, if there is variation in mutation rate we can expect to see variation in the levels of genetic diversity. This has been shown in e.g. birds and rodents (Berlin et al. 2006; Gaffney and Keightley 2005). Secondly, we can expect to see a positive correlation be-tween levels of polymorphism and recombination rate (e.g. Hudson 1994; Nachman 2001; Spencer et al. 2006). Thirdly, effects of selection and drift should affect different types of polymorphisms similarly, i.e. the levels are expected to correlate, as shown by Mills et al. (2006).

An obvious question in extension to this is how can we distinguish be-tween selection and mutation rate in a region of low polymorphism? The probably most common approaches are to use the Hudson-Kreitman-Aguade test of neutrality (Hudson et al. 1987) or the McDonald-Kreitman test (McDonald and Kreitman 1991). The latter was initially devised to contrast synonymous polymorphism and divergence to non-synonymous polymor-phism and divergence. The latter was recently used in a more generalised framework where a neutral region is compared to a region where one would like to test if selection is acting, to infer positive selection in non-coding regions of the Drosophila genome (Andolfatto 2005). The problem of this approach is that one needs a priori knowledge of what to use as neutral ref-erence. This is not always evident (e.g. Andolfatto 2005).

Small scale sequence length polymorphisms Polymorphisms in sequence length, i.e. the insertion or removal of sequence, occur on various scales. The most commonly occurring variations are in the size-range of one to fifty bp, with a negative exponential relationship be-tween length and occurrence (Mills et al. 2006). However, longer sequences are also involved in length variation, with lengths up to several hundred thousand bp (Bailey et al. 2002). It is likely that the longer the insertions and deletions are, the greater the risk that they are lethal or strongly deleterious, simply because they affect more DNA. However, to what extent the seen size-distribution is caused by selection or if long mutations happen more seldom, is not known. Furthermore, the mutation rate, and thus to some ex-tent the occurrence, of length polymorphisms, is dependent on sequence context. The most obvious example of this is tandem repeated regions, like microsatellites or minisatellites. Such regions have a higher probability of

Page 23: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

23

mutating than unique sequence, although the mechanism varies (Bois 2003; Ellegren 2004; Mills et al. 2006).

Short insertion and deletion polymorphisms Short non-repetitive insertions and deletions (indels) have to date not been studied to the same detail as base-substitution. A number of studies have used comparative genomic approaches to study divergence between species through indels (Chen et al. 2007; Makova et al. 2004; Ogurtsov et al. 2004; Taylor et al. 2004; Yang et al. 2004). However, indels in alignments have mostly been regarded as a nuisance as they cause sparse columns in align-ments that cannot be used in divergence estimates with methods aimed at studying substitution patterns. This can probably be traced to the fact that the patterns of indels seen in alignments are to large extent dependent on the alignment parameters (e.g. Holmes 2005). Thus, the ideal solution would be to study indels segregating in a population, where the alignments will be mostly unambiguous.

Most studies where polymorphism screening has been performed have been aimed at SNPs or microsatellites. Recently, however, two large poly-morphism screenings have been performed where indels has been character-ised: one in chicken (International Chicken Polymorphism Map Consortium 2004) and one in human (Mills et al. 2006). These studies have shed new light on the distribution of indels in the genome, with the perhaps most im-portant findings being that indels are in the order of one tenth as common as SNPs and that the genomic density of indels and of SNPs correlate strongly within the human genome.

The strong correlation between the occurrence of indels and SNPs points towards that the same evolutionary forces affect them. There is evidence from in vitro studies that indels, similarly to substitutional mutations, are caused by erroneous replications of DNA (Bierne et al. 1997; Bierne and Michel 1994). However, unless the fidelity of replication with regard to in-dels and SNPs is affected by the same factors to create such a correlation, a more likely scenario is that population genetic effects, other than mutation rate variation, cause the correlation. This could be variation in the effects of background selection or hitchhiking due to variation in recombination rate (Charlesworth et al. 1993). Another possible explanation could be that re-combination is mutagenic (Hellmann et al. 2003; Lercher and Hurst 2002).

A common result from many studies of indels is the presence of a deletion bias of up to five deletions per insertion event (Comeron and Kreitman 2000; Cooper et al. 2004; Mills et al. 2006; Neafsey and Palumbi 2003; Ophir and Graur 1997; Petrov et al. 2000; Vinogradov 2002; Zhang and Gerstein 2003). This bias has been argued to be an important aspect in the evolution of genome size, as it keeps the genome size down (Kozlowski et al. 2003; Vinogradov 1997; Vinogradov and Anatskaya 2006). However, others have

Page 24: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

24

argued that such a small bias would not be capable of counteracting large-scale events such as insertions of transposable elements or segmental dupli-cation (e.g. Gregory 2003; Gregory 2005a). Moreover, it is unlikely that the fitness effects of individual small deletions would be sufficient for natural selection to favour a mechanism that affects the indel bias.

Microsatellite polymorphisms Microsatellites are probably the most frequently used polymorphic marker of today. This owes to a number of important properties making them attractive to work with: as said earlier, microsatellites are amply available in the ge-nomes of most eukaryotes; it is comparably easy to design new markers without the need for full genomic sequence, e.g. through enrichment librar-ies; microsatellites are highly informative as a single locus can have many alleles, not only two as for SNPs; and, typing of microsatellites can be done through simple fragment length separation (see e.g. Buschiazzo and Gem-mell 2006; Ellegren 2004).

To use microsatellites in population genetic contexts to infer, for instance, population separation, mutational models describing mutational patterns are required. Many models for microsatellite mutation have been proposed to match the underlying mutational mechanism including the commonly used stepwise mutation model and derivates of it (reviewed in Buschiazzo and Gemmell 2006). However, to study microsatellite mutation using available microsatellite markers has been difficult. The major reason for this is that most markers available from population genetic studies have passed through a screening process where high heterozygosity has been a selection criterion. Thus there is likely to be an ascertainment bias introduced. Any description of variation within and between genomes using such microsatellites will be inherently biased. While keeping this in mind, these markers have shown evidence of very high mutation rates, in the order of one mutation per 1000 generations, although this varies between loci (Ellegren 2000).

Another approach to study microsatellite mutations is to analyse de novo mutations seen in pedigrees (Primmer et al. 1996; Weber and Wong 1993), which has showed that longer microsatellites are more mutable than shorter. This finding agrees with the general practice of choosing long microsatellites when screening microsatellites for polymorphic loci to be used as markers. Furthermore, there seems to be a bias towards mutations causing increases in size rather than decreases (Brinkmann et al. 1998; Kayser et al. 2000; Wierdl et al. 1997). The consequences of such a continuous growth are quite obvi-ous. There seems, however, to be a limit on microsatellite length, as most microsatellies do not exceed a certain length. It is likely that this is due to the accumulation of point mutations which stabilises the microsatellite by divid-ing it into two smaller microsatellites and thereby stops its growth (Calabrese et al. 2001; Kruglyak et al. 1998).

Page 25: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

25

Research aims

The general aim of this thesis has been to study the evolution of the non-coding genome. More specifically to use bioinformatic approaches to ex-plore:

� the extent of negative selection on vertebrate non-coding se-

quences. � properties and evolution of short insertion and deletion mutations in

the chicken genome. � effects of recombination on microsatellite polymorphisms. � properties and evolution of microsatellites in the chicken genome.

Page 26: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

26

Summaries of papers

Paper I: Evidence for turnover of functional noncoding DNA in mammalian genome evolution One of the most important questions regarding the evolution of the non-coding genome is to what extent it is functional and constrained by negative selection. The most common approach to this has been to use multiple spe-cies comparisons and infer function from the conservation of sequences (Elnitski et al. 2003; Glazko et al. 2003; Ludwig 2002). However, there are aspects that potentially can render the proportion conserved sequence (PC) an inaccurate estimate of the proportion under negative selection (PN). It is well known that there is strong regional variation in mutation rates within genomes (Ellegren et al. 2003; Silva and Kondrashov 2002; Smith et al. 2002; Wolfe et al. 1989). This variation will lead to that some regions found to be conserved are just a result of low mutation rate (Ellegren et al. 2003; Hare and Palumbi 2003). Thus one can divide PC into two fractions: se-quence conserved through negative selection (PCN) and sequence conserved through variation in mutation rate (PCM).

The use of PCN is problematic for other reasons too. Using PCN in com-parisons of distantly related species to estimate PN will not estimate the cur-rent day PN. Rather, it will estimate the shared proportion of the genome that has been under negative selection among the studied species. There is evi-dence for a high rate of turnover of regulatory elements (Dermitzakis and Clark 2002; Ludwig 2002; Ludwig et al. 2000; Ohta 2002), and in light of this, it is unlikely that the patterns of negative selection on non-coding se-quences have been constant over such time spans. Thus, PCN will underes-timate PN, if there is turnover. For instance, based on whole genome align-ments of humans and rodents, the estimated proportion of the genome being conserved and non-coding is 3.5% (Gibbs et al. 2004; Mouse Genome Se-quencing Consortium 2002). However, if there is an independent turnover of 50% of the constrained non-coding sequences in the branches to the common ancestor of rodents and humans, then PN could be as much as 14%.

Page 27: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

27

In this paper we investigated the turnover of sequences under selective constraint, using a dataset of sequences from homologous regions of eight mammals.

Results and discussion We used sequence data from eight mammalian species previously reported by Thomas et al. (2003). The sequences were aligned with either MAVID (Bray and Pachter 2003) or MLAGAN (Brudno et al. 2003), as different alignment methods affect the patterns of sequence divergence and conserva-tion seen. In general both alignment programs produce well-behaved align-ments, although differences in gap-placement yields slightly different diver-gence estimates.

To estimate PCM we took a simulation approach similar to (Hare and Palumbi 2003). Simulation of mutation rate variation is complicated by the fact that mutation rates varies at different levels, with most variation found on the bp and kb levels (Silva and Kondrashov 2002). To account for this we used a model with mutation rate variation at these two levels. We used the gamma distribution to model the different parameters, as a single shape pa-rameter, commonly called alpha, allows the distribution to take different forms. The small scale rate variation was modelled as an among-site gamma distribution, while the regional variation was modelled with an among region gamma distribution. We estimated the alpha parameter of the among-site variation from the genomic alignments, and the among-region alpha parame-ter from human-baboon alignments. By simulating multiple alignments, the alpha-parameters were tuned to 6 and 25, for among-site and among-region variation, respectively. As this simulation is tuned to real sequences, it will rather overestimate PCM than underestimate it, as conservation seen in the real sequences is the sum of PCM and PCN.

A simple method to define conserved regions in a pairwise alignment is to identify regions with a given minimum size and level of conservation. This approach is, obviously, highly dependent on the parameters chosen. How-ever, as our primary intent was to study the turnover of functional noncoding DNA, we chose to tune our detection to a given benchmark of PCN = 1% between mouse and human. Using a 50 bp long sliding window, where we varied the threshold of conservation required, and also simulations of PCM, we found 90% conservation to be the threshold where PCN was 1%.

The basic idea of the test for turnover of functional elements is that PCN will be negatively correlated to pairwise divergence, in the presence of turn-over. This can be tested using all the 21 pairwise comparisons among the 8 species in our alignment. We acknowledge that all pairs are not fully inde-pendent as they share some ancestry. This should not, however, be critical to the test of turnover. Coding sequences was removed from the alignments, and PCN was estimated as described, as well as divergence, K, which was estimated using PAML (Yang 1997). If we assume that the turnover of func-

Page 28: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

28

tional noncoding elements is proportional to the divergence, then we expect a simple negative exponential model to describe the relationship of PCN and K,

PCN = PNe-BK

where B is the proportionality constant. Taking the natural logarithm of both sides gives

ln(PCN) = ln(PN) - BK,

which can be used to fit a linear regression, as ln(PCN) and K are available from the pairwise comparisons (Figure 2). Based on this model ln(PN) is -2.3, corresponding to a PN of 10%. This figure should be treated with cau-tion, though, as the y-axis intercept is uncertain. However, the qualitative finding that there is turnover is rather more certain. The finding of turnover is robust to choice of alignment program, as indicated when we used MLAGAN instead of MAVID to align the sequences.

The results presented here unveil a potential problem of current compara-tive methods of finding functionally constrained sites. Cooper et al. (2003) suggest that a broad mammalian sample would suffice to detect selection at the single bp level. Their method relies on the assumption of no turnover, and will never be capable of detecting individual bases under constraint, if there is turnover. However, sequences under extreme constraints will likely still be detectable. A more viable alternative might be to use many closely related species instead (Boffelli et al. 2003).

Figure 2. The relationship between the proportion of the noncoding genome estimated to be conserved by negative selection, PCN, and the pairwise divergence, K, plotted for 21 different pairwise mammalian comparisons. The regression line for ln(PCN) versus K is also shown, along with its equa-tion and the R2 value.

Page 29: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

29

We emphasise the need for future work to confirm the results presented here, especially in terms of a wider sample of genomic regions. Furthermore, the addition of more species could provide the possibility of selecting fully independent species pairs, which would benefit the statistical analysis.

Paper II: The genomic landscape of short insertion and deletion polymorphisms in the chicken (Gallus gallus) genome: a high frequency of deletions in tandem duplicates Despite the fact that indels contribute significantly to the divergence be-tween species (Britten 2002; Britten et al. 2003; Chimpanzee Sequencing and Analysis Consortium 2005), they have received relatively little attention. One reason for this could potentially be that indels are a rather heterogene-ous class of mutations, spanning from small events encompassing one or a few bp, up to large segmental duplications and deletions. Some of these types have been investigated to varying detail, such as retrotransposons (Price et al. 2004), minisatellites (Bois 2003), microsatellites (Ellegren 2004) and segmental duplications (Samonte and Eichler 2002).

Small mutation events, encompassing a few bp of non-repetitive nature have been studied using comparative genomics approaches (Makova et al. 2004; Ogurtsov et al. 2004; Taylor et al. 2004; Yang et al. 2004). However, methodological problems of aligning divergent sequences might produce alignments with ambiguous indel placements and properties (Holmes 2005). A preferable solution to this is to use alignments of closely related species, or even better, intraspecific detection of polymorphism (e.g. Mills et al. 2006).

There are a number of reasons to gather more information on the evolu-tion of indels: 1) Indels are likely to represent an important source of pheno-typic variation (Chen et al. 2005a; Chen et al. 2005b). 2) Indels have been recognised as an important determinator of genome size (Gregory 2005b). 3) Analysis of indels might reveal constraints in e.g. regulatory regions with, at least in part, length dependence rather than sequence dependence (Ometto et al. 2005). 4) Indels might be used as unique event markers for phylogenetic reconstruction, thus avoiding some of the problems of homoplasy and con-vergence (Fain and Houde 2004; Hamilton et al. 2003; Kawakita et al. 2003; Müller 2006).

Results and discussion We used a set of 274,000 indel polymorphisms provided by the International Chicken Polymorphism Map Consortium (2004). These length polymor-phisms consist of both length variants in unique sequence, as well as in re-

Page 30: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

30

petitive sequence, i.e. microsatellites. The former group was the focus of the study, and thus we removed all indels in repetitive sequence, where the longer allele was repeated three or more times. The resulting dataset used in all analysis consisted of 140,484 indels.

The size distribution of indels follows a negative exponential distribution, with 1bp events being by far most common and with the average length 3.6 bp. We found a total indel density of roughly one indel per 5 kb, which is 5% of the SNP rate (International Chicken Polymorphism Map Consortium 2004). In general the patterns of indels are similar to what has been seen in other species (Bhangale et al. 2005; Chimpanzee Sequencing and Analysis Consortium 2005; Mills et al. 2006; Petrov et al. 2000; Zhang and Gerstein 2003). For 2 bp to 5 bp indel words, we found significant deviations from the expected frequencies. For instance, AT and AG were overrepresented, while GC was underrepresented. In general, AT-rich motifs were overrepre-sented.

We made several observations indicating that replication slippage is an important mechanism for indel formation. Indels within duplicated words were significantly overrepresented among the polymorphic indels. More-over, using turkey as outgroup, we were able to infer the direction of muta-tion for 569 indel events. Among these, we saw a strong overrepresentation of deletions in duplicate words, whereas this overrepresentation was not nearly as pronounced for insertions. Furthermore, the above mentioned over-representation of AT-rich motifs is compatible with the weaker association between the DNA strands, which possibly can promote temporary dissocia-tion, i.e. the first step in replication slippage (Levinson and Gutman 1987).

It has been debated whether small indels provide a significant contribu-tion to the evolution of genome size (reviewed in e.g. Gregory 2004; Petrov 2002a ). Using the set of indels where we could infer direction of the muta-tion (insertion or deletion), we found a bias of 1.4 towards deletions. This is in the lower end of what has been seen in other species (Comeron and Kreitman 2000; Cooper et al. 2004; Neafsey and Palumbi 2003; Ophir and Graur 1997; Petrov et al. 2000; Vinogradov 2002; Zhang and Gerstein 2003). It is unlikely, however, that there is selection on a modified insertion-deletion ratio, as each event involves a limited number of bp (Petrov 2002b).

We divided the chicken genome into 1 megabasepair (Mb) non-overlapping windows. The indel distribution within the chicken genome shows a significant heterogeneity, with trend of lower densities in the small micro chromosomes. The density of indels was strongly correlated to the SNP density in the windows (Figure 3). Local genomic context has been shown to have an influence on the rates of evolution and intraspecific poly-morphism. This often results in correlations between several genomic pa-rameters, including indels (Hardison et al. 2003). For instance, GC content has been shown to correlate with nucleotide substitution rate (reviewed in Ellegren et al. 2003), possibly due to a combination of biased gene conver-

Page 31: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

31

sion and CpG-mutability (Webster et al. 2006). We found that both indel and SNP density correlates with GC content. This could indicate that rates of both types of polymorphism are affected by a similar mechanism, manifested in GC content. However, correcting both indel density and SNP density for GC level, we still see a significant correlation between indels and SNPs. This indicates that GC content cannot explain the corre-lation alone. Indeed, it could be the case that the same factors, for instance repair, replication or recombination, affect both indels and SNPs. Another possible explanation is that the correlation is due to varying levels of se-lection, governing the overall levels of variability. To fully disentangle the relationships, further studies are required.

Figure 3. Correlates of indel and SNP density in 1 Mb windows. A significant correlation between indel density and SNP density is seen (A). Both indel den-sity (B) and SNP density (C) are correlated with GC level. However, correcting indel density and SNP density for GC does not remove the correlation (D).

Page 32: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

32

Paper III: Microsatellite polymorphism is elevated in human recombination hotspots It has been a long-standing issue whether microsatellites simply represent “junk” DNA or if some function can be attributed to them (Kashi et al. 1997; Kashi and King 2006; Li et al. 2002; Li et al. 2004). Although most re-searches have regarded them as neutrally evolving markers, there has re-cently been frequent reports on association between microsatellite alleles and various phenotypic traits (Contente et al. 2002; Curi et al. 2005; Hammock and Young 2005; Uhlemann et al. 2004), supporting the idea that microsatel-lites are involved in gene regulation (Hamada et al. 1984; Li et al. 2002; Naylor and Clark 1990; Santoro et al. 1984; Struhl 1985).

Another aspect of the potential function of microsatellites is evidence for an association of microsatellites and recombination. For instance, in mam-mals, an association between microsatellite occurrence and regions of high recombination rate has been seen (Isobe et al. 2002; Jensen-Seaman et al. 2004; Kong et al. 2002; Myers et al. 2005; Schultes and Szostak 1991; Templeton et al. 2000). However, microsatellite variation has not been found to be associated with recombination rates in humans (Huang et al. 2002; Payseur and Nachman 2000).

Recombination is concentrated to narrow hotspots of a few kb in length (Jeffreys et al. 2001; Jeffreys et al. 1998; McVean et al. 2004; Myers et al. 2005). If there is an association between microsatellites and recombination, we might expect to see the strongest signals of such an association within the hotspots. In this study we investigate the relationship between recombination and microsatellites by using data on the location of recombination hotspots and genome wide information on microsatellite occurrence and polymor-phism.

Results and discussion If recombination promotes microsatellite growth, or if microsatellites stimu-late recombination, we should expect to see an increased microsatellite den-sity in recombination hotspots. We mined the human genome for perfect microsatellites using a modified version of sputnik (Morgante et al. 2002) and found, indeed, that microsatellites are both longer and more common in recombination hotspots. However, the biological relevance of these differ-ences is not obvious, as the differences are very small. In contrast to this result, microsatellites are more than twice as frequent in Saccharomyces cerevisiae recombination hotspots compared to non-hot regions (Bagshaw and Gemmel, unpublished). Moreover, on broad scales, the association be-tween simple sequences and recombination is quite marked (Jensen-Seaman et al. 2004; Kong et al. 2002). A possible explanation to the contrasting re-sults on different scales is the ephemeral nature of human hotspots (Ptak et al. 2005; Winckler et al. 2005), i.e. that they do not last long enough to exert

Page 33: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

33

a strong effect on local microsatellite occurrence. On a broad scale, recom-bination is more constant and an association is more likely to be seen.

To analyse the relationship between recombination and microsatellite mu-tation we turned to the Allele FREquency Database (ALFRED) (Rajeevan et al. 2003). We obtained three mutability estimates for the 282 microsatellites we found in the database: allele span, as the difference between longest and shortest allele; number of alleles; and heterozygosity. None of these esti-mates deviated significantly between microsatellites in hotspots and the ge-nomewide expectation. However, this result could be a methodological arte-fact. Most markers found in allele frequency databases are likely to have been developed for population screening or linkage mapping, and thus ini-tially chosen on basis of known high heterozygosity (Selkoe and Toonen 2006). Such an ascertainment bias (see Ellegren et al. 1997; Ellegren et al. 1995) could mask an underlying difference between regions of the genome.

To overcome the problem of possibly biased markers, we turned to a set of about 44,000 length polymorphisms within tandem repeats (Mills et al. 2006). These were identified through the reanalysis of shotgun sequencing initiatives. Using this data set we saw a significant increase of microsatellite polymorphism of 14% (Table 3). There are at least three possible explana-tions to this increase in microsatellite polymorphism: polymorphic microsa-tellites promote recombination; recombination is mutagenic to microsatel-lites; and/or recombination reduces the effect of selection at linked sites.

Table 3. Polymorphism data for microsatellites, SNPs and indels in recombination hotspots compared to genomic expectations. Ninety-five percent confidence intervals were estimated using resampling test with 1,000 replicates.

Genomic expectation

Hotspots Lower 95% CI

Resampling median

Upper 95% CI p-value

Number of polymorphic mi-crosatellites 15,059 12,172 12,288 12,363 p < 0.001 Proportion polymorphic mi-crosatellites 0.0136 0.0116 0.0119 0.0122 p < 0.001 Number of SNPs 442,015 346,640 355,125 384,772 p < 0.001 Number of indels 22,218 18,653 19,051 19,477 p < 0.001

There are several examples of long microsatellites that have been identi-

fied within recombination hotspots (Cullen et al. 2002; Majewski and Ott 2000; Rana et al. 2004), and also examples of microsatellites affecting hot-spot activity (Schultes and Szostak 1991). However, only a handful of hot-spots have yet been thoroughly examined with regard to mechanism, and thus the involvement of microsatellites is still an open question. An interest-ing, yet speculative, possibility is that length variation as such promotes recombination through the decreased stability between sister chromatids (Kayser et al. 2006). On the other hand, there are several other forms of

Page 34: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

34

length differences which ought to have greater impact in this respect, e.g. minisatellites and segmental duplications (Jeffreys et al. 1999; Sharp et al. 2005).

There is some experimental evidence supporting that recombination is mutagenic to microsatellites in Escherichia coli and S. cerevisiae (Gendrel et al. 2000; Hashem et al. 2004). However, there is conflicting evidence show-ing that recombination is not important for microsatellite mutation (Henderson and Petes 1992; Levinson and Gutman 1987). Most notable are the studies showing similar mutation rates on the non-recombining Y-chromosome as on autosomes (Ellegren 2004; Gusmao et al. 2005; Kayser et al. 2000).

The third explanation to increased levels of microsatellite polymorphisms within recombination hotspots is the reduced effects of selection on linked sites. Correlations of genetic diversity and recombination rate has been seen in many species (Andolfatto and Przeworski 2001; Aquadro et al. 1994; Be-gun and Aquadro 1992; Nachman 1997; Nachman et al. 1998). Such correla-tion could be due to background selection (Charlesworth 1994; Charlesworth et al. 1995; Hudson 1994; Hudson and Kaplan 1995; Nordborg et al. 1996) and/or genetic hitchhiking (Kaplan et al. 1989; Maynard Smith and Haigh 1974). However, previous studies have not found any association between local recombination rate and microsatellite diversity (Huang et al. 2002; Payseur and Nachman 2000; Thomas et al. 2005; Yu et al. 2001). In contrast to this, we found a rather strong association between microsatellite diversity and recombination.

If we assume that this selectionist explanation is correct, we would expect other kinds of polymorphism to follow the same patterns. To explore this, the same data set was used to look at non-microsatellite indels as well as SNPs. Also for these kinds of polymorphisms we saw significant increases in polymorphisms, with similar magnitudes as for microsatellites (Table 3). This result supports an explanation where the increased levels of recombina-tion in hotspots decouples selection, thus allowing increased levels of poly-morphism. However, it does not rule out the alternative explanation of re-combination being mutagenic to all three kinds of polymorphisms, SNPs, indels and microsatellites.

Paper IV: A genome-wide analysis of microsatellite polymorphism circumventing the ascertainment bias A number of models have been suggested to describe the mutational patterns of microsatellites. These range from the application of the simple stepwise mutation model to much more complex models involving various parame-

Page 35: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

35

ters, including point mutations and length dependent mutation patterns (Bell 1996; Bell and Jurka 1997; Cornuet et al. 2006; Di Rienzo et al. 1994; Dier-inger and Schlötterer 2003; Kruglyak et al. 1998). To test the validity of these models a number of empirical approaches have been taken. For in-stance, direct observations of de novo mutations in pedigrees (Ellegren 2000; Huang et al. 2002; Weber and Wong 1993). However, any approaches based on already characterised microsatellite markers will suffer from ascertain-ment bias. This is due to the way researchers have initially selected the markers, a process where several screening steps are involved and aimed at finding markers with high levels of heterozygosity and many alleles. Such markers are likely biased towards long microsatellites with high mutation rates (Pardi et al. 2005). Whole genome sequence surveys provide a means of testing microsatellite distributions against theoretical models (Dieringer and Schlötterer 2003). This approach will, however, provide a snapshot of the current microsatellite distribution, rather than capturing the ongoing processes of microsatellite evolution. This can to some extent be remedied by comparing two or more genomes (Sainudiin et al. 2004). However, for an accurate description of the ongoing processes, unbiased mutation data is needed.

Some of the genome sequencing projects have been complemented with separate screenings for polymorphisms, through re-sequencing or light shot-gun sequencing of a number of unrelated individuals (International HapMap Consortium 2005; Lindblad-Toh et al. 2005). One of the largest efforts of this kind is the light shotgun sequencing of three unrelated domestic chicken strains (0.25 X coverage each), done in conjunction to the sequencing of the chicken genome (International Chicken Genome Sequencing Consortium 2004). The traces generated from these projects provide an opportunity to study microsatellite polymorphism without ascertainment bias.

In this study we use the traces from the chicken polymorphism project to obtain an unbiased set of microsatellites. An important aspect of the data is that in addition to polymorphism information, we also get information on monomorphic loci.

Results and discussion We identified 390,000 perfect di- through pentanucleotide repeats of at least 3 repeat-units, where two chromosomes were sampled. Among these, 1.8% were polymorphic. Microsatellite mutation has previously been shown to be length dependent, with higher heterozygosity seen for longer microsatellites (Brinkmann et al. 1998; Ellegren 2000). We found a clear pattern of longer microsatellites being more polymorphic (Figure 4). To examine what factors affect the mutability of microsatellites in more detail we fitted a logistic re-gression model with presence of polymorphism as a function of microsatel-lite length, length of repeat unit, repeat unit motif, repeat unit GC-content.

Page 36: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

36

We found a significant effect of all parameters. In general, longer microsa-tellites are more polymorphic, and so are AT-rich microsatellites.

Figure 4. Microsatellite heterozygosity for different repeat lengths. A) all motifs and B) dinucleotide microsatellites.

When microsatellites are typed using fragment length separation methods, there is a risk for size homoplasy, i.e. two alleles have the same size, but different lengths. We noted that 6% of the microsatellites showed no length difference, but a sequence difference. Furthermore, adding presence of im-perfections to the logistic regression model showed that microsatellites with imperfections need to be longer to be polymorphic. This fits well with mod-els of microsatellite mutation suggesting that point mutations in microsatel-lites will have a stabilising effect (Bell 1996; Bell and Jurka 1997; Kruglyak et al. 1998).

The patterns of polymorphism are heterogeneous among chromosomes, and also among microsatellite motifs. As we have previously shown a strong correlation between indel polymorphisms and SNPs (Paper II), we wanted to see if microsatellites follows the same patterns. We studied the variation in microsatellite variability across the chicken genome in 1 Mb windows, in relation to a number of contextual features of the windows. Microsatellite polymorphism was found to correlate strongly with both SNP and indel den-sity. The reasons for such a correlation can be extensively discussed. Either similar mechanisms affect the mutation rates for all three kinds of polymor-phisms, or the correlation may be a result of varying levels of selection. The latter would be mediated by background selection (Charlesworth 1994; Charlesworth et al. 1995) or selective sweeps (Maynard Smith and Haigh 1974).

We studied the patterns of microsatellite occurrence in 1 Mb windows in the genome. We found significant variation in microsatellite occurrence among chromosomes. This is expected as G+C-content varies among the chromosomes (International Chicken Genome Sequencing Consortium

Page 37: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

37

2004) and base composition is important in determining microsatellite occur-rence (Dieringer and Schlötterer 2003; Toth et al. 2000). A more interesting result was that microsatellite occurrence is negatively correlated with SNP density. Although the explanatory value was rather low, this gives an impor-tant clue to the mutational mechanisms of microsatellites. Within-genome variation in mutation rates has been seen in a number of organisms (Berlin et al. 2006; Gaffney and Keightley 2005). Furthermore, point mutations have been suggested to be important in determining the fate of microsatellites (Bell 1996; Bell and Jurka 1997; Kruglyak et al. 1998). Thus, our observa-tion supports the importance of point mutations stopping microsatellite growth, if some of the variation in SNP density can be explained by muta-tion rate variation.

Paper V: ILAPlot: An Interactive Local Alignment Plotting Tool There are a number of computer programs and web services available for plotting and exploration of results from homology search programs. A few examples are NIH BLAST viewer (www.ncbi.nlm.nih.gov/blast), PipMaker (Schwartz et al. 2000), VISTA (Dubchak et al. 2000; Mayor et al. 2000), Theatre (Edwards et al. 2003), SimiTri (Parkinson and Blaxter 2003), DiagHunter/GenoPix2D (Cannon et al. 2003), Dotter (Sonnhammer and Durbin 1995), Cross (http://comb5-156.umbi.umd.edu/poster/Cross/) and Laj (Wilson et al. 2001).

In the initial steps of comparative studies when, for instance, a fully se-quenced BAC is to be aligned to a genome of another organism, BLAST (Altschul et al. 1997) can be used to find the correct placement. When plot-ting the BLAST result like a dot-plot, a diagonal will be formed by hits spanning the region where the BAC aligns. However, few programs allow the user to export data aided by the plot. Laj (Wilson et al. 2001) provides a means of exporting sequence data, but only the alignments of the individual hits.

Results and discussion To overcome the limitations of the available programs, ILAPlot was written. ILAPlot parses BLAST results and plots them similarly to a dot-plot (Figure 5). The program then lets the user explore the hits by zooming and selecting hits of interest. The most important functions of the program are provided by the different ways of exporting data aided by the selection of hits. There are four ways of exporting data provided by the program: 1) Export a list of start and end coordinates of all selected hits. 2) Export FASTA-formatted align-ments of all selected hits. 3) Export coordinates of the region spanned by

Page 38: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

38

two selected hits. 4) Export FASTA formatted sequence files with the sub-selections of the two sequences spanned by the selected hits. The fourth method allows for an easy way to select the part of a long sequence (e.g. a fully sequenced BAC) that matches a specific part of another long sequence (e.g. a chromosome) by just selecting the first and last hit of a diagonal in the plot. The sequences can later be rigorously aligned using a suitable align-ment program.

ILAPlot was used successfully for this purpose in Paper II in this thesis, where it reduced the time required for preparation of data drastically com-pared to manual parsing of BLAST results to find regions of orthology.

Figure 5. The plotting window of ILAPlot showing the diagonal formed by BLAST hits between two long sequences.

Page 39: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

39

Summary in Swedish

När vi ser all den biologiska mångfald som finns är det svårt att undvika frågan om vilka processer som ligger bakom den. Det finns skillnader mellan arter, men även uppenbara skillnader inom arter. Ytterst är det arvsmassan och interaktioner mellan den och miljön, som bestämmer hur varje individ kommer att se ut. Det faktum att arvsmassan har en så stor påverkan har lett intresset mot att förstå hur evolution går till på DNA-nivå. Hur ser arvsmas-san ut? Vilka delar är viktiga och oviktiga? Hur muterar den? Idag förstår vi idag mer och mer av processerna som är en del av evolutionen av alla orga-nismers genom. Många av de grundläggande processerna är universella bland de organismer, från bakterier till ryggradsdjur och växter, som hittills har studerats.

Möjligheten att kunna avläsa DNA-sekvensen i en organism var ett vik-tigt steg för att öka förståelsen av hur arvsmassan evolverar. Tekniken för detta har rusat framåt. För sex år sedan hade man avläst människans arvs-massa. Detta gav en mängd nya insikter i hur informationen i arvsmassan är uppbyggd. En av de viktigaste upptäckterna var att endast omkring 2,5% av arvsmassan utgörs av proteinkodande gener, och att dessa var endast drygt 20 000 till antalet. Tidigare hade man antagit att stora delar av arvsmassan bestod av proteinkodande gener och att det kunde vara uppåt 100 000 gener. Idag är ett flertal arters arvsmassa avläst. Vi kan därmed jämföra deras DNA-sekvenser och utifrån skillnader och likheter dra slutsatser om vilka krafter som genom den evolutionära historien varit med och format genomen som vi ser idag.

När mängden data växer behövs nya effektiva verktyg och metoder för att hantera den. Dessa metoder och verktyg finns i nya och kraftfulla datorer och program, tekniker som även de utvecklas i ett högt tempo. Tillämpning-en av datorer för att hantera och analysera biologiska data brukar kallas bio-informatik och är den teknik som använts i de olika delarbetena som ingår i den här avhandlingen.

När bara 2,5% av människans arvsmassa är proteinkodande gener, vad är då resten, den ickekodande delen av arvsmassan? Efterhand som mer DNA-sekvens blev avläst visade det sig att större delen av arvsmassan bestod av olika former av upprepade sekvenser. Den största gruppen är upprepade, men utspridda DNA-sekvenser som på ett eller annat sätt kan kopiera in sig själva på nya platser i arvsmassan. Ett exempel på sådana sekvenser är de så

Page 40: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

40

kallade LINE-elementen. De består av en uppsättning gener med allt som behövs för att kopiera sig själv till en ny plats.

En annan kategori av det ickekodande genomet är mikrosatelliterna. De är tandem-upprepade sekvenser av 1-5 baser långa sekvensmotiv, som till ex-empel ATATATATAT. Det intressanta med den här gruppen sekvenser är deras mutationsprocess. Den vanligaste mutationen är att det tillkommer eller försvinner en eller flera upprepningar och det händer kanske tusen gånger oftare än mutationer som orsakar enbasparsskillnader. Mikrosatelliter utgör i storleksordningen några procent av de flesta högre ryggradsdjurs arvsmassa.

För att studera evolution på DNA-sekvensnivå kan man jämföra arvsmas-san mellan olika arter. Eftersom det hela tiden sker mutationer kan man för-vänta sig att två sekvenser som har ett gemensamt ursprung efter en tid kommer att ha ackumulerat skillnader. Skulle däremot en region ha en viktig funktion kommer eventuella mutationer vara skadliga för organismen. Denna form av selektion brukar kallas negativ selektion. Resultatet blir att skillna-derna kommer att vara betydligt färre mellan arter på de platser i genomet som har viktiga funktioner. Detta har utnyttjats för att uppskatta hur stor andel av det ickekodande genomet som har viktiga funktioner. Resultatet av jämförelser mellan däggdjur och till exempel kyckling och blåsfisk har visat att ca 5% av arvsmassan är bevarad mellan arterna. Det medför att 3,5% av arvsmassan är ickekodande och bevarad. Vilken funktion denna del har bör-jar man få en viss inblick i. Framför allt verkar det vara reglersystem för transkriptionen av gener som återfinns där.

Det finns en risk att siffran 3,5% är en underskattning av vilken andel av arvsmassan som utsätts för negativ selektion. Om det förekommer omsätt-ning av regioner som utsätts för negativ selektion, kan det i själva verket vara en betydligt större andel. Låt oss anta att 50% av de sekvenser som utsätts för negativ selektion i mus har bytts ut sedan den gemensamma anfa-dern med människa. Anta också att samma sak skett i grenen till människa. Detta skulle innebära att hela 14% av arvsmassan utsätts för negativ selek-tion. I så fall är 3,5% en underskattning.

I delarbete I i den här avhandlingen studerade vi DNA-sekvensen i en in-passning (alignment) av nästan 2 megabaser sekvens från åtta däggdjur. Om det förekommer omsättning av sekvenser utsatta för negativ selektion skulle det kunna beskrivas med en modell där andelen bevarat ickekodande DNA avtar exponentiellt över tid. Vi tog alla parvisa jämförelser i inpassningen och avsatte andelen bevarat ickekodande DNA mot tid sedan arterna skiljdes åt. Vi fann ett signifikant samband och extrapolerar vi till idag ger det att drygt 10% av arvsmassan utsätts för negativ selektion. Eftersom det är en extrapolation på en logaritmisk skala är uppskattningen av andelen DNA som utsätts för negativ selektion är osäker. Däremot är det med detta resultat tämligen klart att det förekommer omsättning av vilka sekvenser som utsätts för negativ selektion.

Page 41: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

41

En annan viktig aspekt av evolutionen är vilken variation vi ser inom arter på DNA-sekvensnivå. Många typer av variation förekommer: allt från ensta-ka baser som skiljer, till strukturella variationer där delar av kromosomer är omkastade eller rentav saknade. Den mest klassiska formen av variation som studerats är enbasparspolymorfier, skillnader i sekvens där enstaka baser skiljer sig åt. En annan viktig form av variation är tillkomster eller förluster av en eller flera baser som ger variation i längd på sekvensen. Ett tydligt exempel är mikrosatellitvariation, som är just av den här typen. Det är även vanligt förekommande med sådan variation som inte är knuten till mikrosa-telliter.

Två huvudsakliga faktorer påverkar förekomsten av variation i arvsmas-san: dels mutationshastigheten, dels förekomsten av selektion. Genetisk drift är det enda som bestämmer ödet för neutrala mutationer. Det innebär att det är slumpurvalet i överföringen från generation till generation som avgör frekvensen av en viss allel, och vilken allel som till slut är kvar. I och med detta är det endast mutationstakten som bestämmer hur mycket variation man ser i en neutral region. Om däremot en ny mutation är skadlig för indi-viden, medför selektionen att den inte kan nå en hög frekvens i en popula-tion. I samband med denna så kallade negativa selektion kommer även kopp-lade neutrala variationer att försvinna. Resultatet blir att den allmänna varia-tionen är lägre kring en plats i genomet som utsätts för negativ selektion. Det här fenomenet kallas bakgrundsselektion. Den andra formen av selektion, positiv selektion, innebär att en ny allel är fördelaktig för individen. Uppstår en sådan variation kommer den snabbt att öka i frekvens i populationen. Samtidigt drar den med sig den variation som är kopplad. Även detta feno-men ger upphov till minskad variation kring den selekterade platsen i geno-met. En central fråga blir alltså om den variation vi ser beror på selektion eller mutation. Det finns tydliga tecken för att båda faktorer spelar in.

Delarbeten II-IV i denna avhandling fokuserar på längdvariationer i DNA-sekvens. I delarbete II har vi studerat korta längdvariationer som inte är tandemupprepningar i tamhönsets arvsmassa. Arbetet är baserat på en gles slumpmässig sekvensering av tre hönsraser som gjordes vid universitetet i Beijing. Syftet var att kartlägga polymorfi i tamhönsets arvsmassa. Fördelen med studier baserade på gles slumpmässig sekvensering är att man kan stu-dera flera olika slags variation, och inte vara begränsad till enbasparspoly-morfier. De icke-repetitiva längdvariationerna förekommer ungefär en tju-gondel så ofta som enbasparspolymorfier. En stor andel av dem visade sig vara exakta kopior av de angränsande baserna. Det tyder på att en viktig mutationsmekanism för dem är att DNA-polymeraset ”halkar” framåt eller bakåt längs sekvensen, och därmed tar bort eller lägger till en upprepning. Det vill säga samma mekanism som orsakar mutationer i mikrosatelliter.

Mikrosatelliter är fokus för delarbeten III och IV. I delarbete III har vi studerat kopplingar mellan mikrosatellitmutationer och rekombination. Även om den idag rådande uppfattningen är att rekombination inte är mutagent för

Page 42: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

42

mikrosatelliter finns det en sådan möjlighet. Det är främst i in vitro-försök man sett tecken på det. Å andra sidan finns det andra försök som visat att mikrosatelliter kan vara en rekombinationssignal. I människa sker de flesta rekombinationshändelserna i korta regioner, ofta några enstaka kilobaser långa som åtskiljs av långa regioner med låg rekombination. Vi jämförde dessa aktiva regioner med övriga genomet och fann en viss ökning i mikro-satellittäthet, men även en 14%-ig ökning av antalet mikrosatellitpolymorfi-er. Vi noterade också en motsvarande ökning av polymorfigraden för längd-variationer som inte är del av mikrosatelliter och enbasparspolymorfier. Det senare talar för att effekter av selektion i kombination med rekombination svarar för den variation vi ser, snarare än att rekombination är mutagent. Den kanske främsta orsaken är att den ökade rekombinationen minskar effekten av genetisk koppling, så att man ser en ökad variation. En annan möjlighet är att någon gemensam faktor påverkar mutationstakten för alla tre typer av polymorfier i mikrosatelliter, längdvariationer och enbasparspolymorfier.

I delarbete IV har vi undersökt förekomsten och variationsgraden för mik-rosatelliter i kyckling-genomet. Även för den här studien använde vi rådata från samma glesa slumpmässiga sekvensering som för de icke-repetitiva längdvariationerna. Den hittills vanligaste metoden för att studera mikrosa-tellitevolution har varit att se på mutationsmönster i redan kända mikrosatel-litmarkörer. Dessa har ofta en gång i tiden karaktäriserats för att användas i kopplingsanalyser eller populationsgenetiska studier. Ett viktigt kriterium för dessa användningsområden är att markörer har ett högt informationsinnehåll, något de har om de har en hög mutationshastighet och därmed många varian-ter. Detta medför en uppenbar risk att de slutsatser man drar om mikrosatel-litmutation och evolution endast beskriver en liten delmängd mikrosatelliter med hög mutationshastighet och inte mikrosatelliter i allmänhet. Vårt tillvä-gagångssätt medför att vi minimerar risken för felrepresentation i dataupp-sättningen. Vi fann att längden av en mikrosatellit starkt påverkar dess po-lymorfigrad. Långa mikrosatelliter är mer polymorfa än korta. Det tycks vara antalet upprepningar i en mikrosatellit snarare än längden i baspar som är avgörande för mutationsbenägenheten. Dessutom är AT-rika mikrosatelliter mer polymorfa än GC-rika. Dessa resultat stämmer väl överens med teorin om att mikrosatelliter muterar genom att polymeraset halkar framåt eller bakåt längs med den DNA-sträng som replikeras och därmed lägger till eller tar bort en enhet. I kyckling fann vi också ett starkt samband mellan poly-morfigraden för mikrosatelliter, längdvariationer och enbasparspolymorfier. Vi såg att polymorfityperna samvarierade i megabaspar-stora avsnitt av ge-nomet. Detta stärker resultatet från delarbete III, att en stor del av variatio-nen i polymorfigrad beror på effekten av selektion och rekombination i kombination.

Det sista delarbetet i denna avhandling (delarbete V) beskriver ett pro-gram, ILAPlot, som visualiserar resultaten från homologisökningsprogram-met BLAST. BLAST används ofta för att hitta regioner som är snarlika ge-

Page 43: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

43

nom sin härstamning. ILAPlot markerar alla träffar mellan två sekvenser på en tvådimensionell rityta där x och y-axlarna är koordinater inom de två jäm-förda sekvenserna. Om de två sekvenserna har stora likheter kommer det att visa sig som en diagonal på ritytan. Programmet har funktioner för att klippa ut delsekvenser ur de ursprungliga sekvenserna baserat på vilka träffar man markerat på ritytan. Denna möjlighet underlättar studier där delsekvenser ur till exempel hela kromosomer ska klippas ut.

Resultaten som presenterats i den här avhandlingen bidrar till en ökad för-ståelse för vilka processer som är med och formar arvsmassan.

Page 44: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

44

Acknowledgement

I know most of you will read this part of my thesis among the first things you read… Feel free to do so. There are a number of people to whom I would like to express my gratitude:

Hans Ellegren, for accepting me as a Ph D student and letting me be part of your research group. It is a very inspiring environment with ambitious projects. I have learnt a lot during these years.

Erik Axelsson, for many interesting discussions on research (and other things). The best way to learn is to discuss and test new ideas.

Sofia Berlin, for good discussions on various topics, and also for good comments on parts of this thesis.

Nick Smith, who left us too early, for asking me what my null hypothesis is. I always keep that in mind now.

Niclas Backström – no, I have not caught any pike this weekend. Matthew Webster, Judith Mank, Niclas Backström, Erik Axelsson, Sofia

Berlin, Lina Hultin-Rosenberg, and other (former) members of the molecular evolution research group. You have provided an inspiring environment for me to be a part of with many exciting discussions of new and ongoing pro-jects. It is a shame we couldn’t clone ourselves to pursue some more of the new ideas that always pop up.

My former room-mates, Cia Anderung, Erik Axelsson and Frank Hailer. Odd that the box for bad jokes never was filled. It must imply the jokes were good…

All other people that are part of the department of Evolutionary Biology. It has been a great environment to work in.

Louise Durling, for good comments on my texts, but also for telling me that I have to do other things than work in my spare time.

My parents, for always believing in me. I have a feeling that you were fairly sure I would end up like this. Hopefully I will have some more spare time now to fix some of the stuff I have intended to.

The Sven and Lilly Lawskis fond and Stiftelsen B. von Beskows fond are gratefully acknowledged for financial support.

Now when you have read this, there are a lot of more interesting things to be read in this book…

Page 45: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

45

References

Altschul, S.F., T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller,

and D.J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.

Andolfatto, P. 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437: 1149-1152.

Andolfatto, P. and M. Przeworski. 2001. Regions of lower crossing over harbor more rare variants in African populations of Drosophila melanogaster. Genetics 158: 657-665.

Aquadro, C.F., D.J. Begun, and E.C. Kindahl. 1994. Selection, recombina-tion, and DNA polymorphism in Drosophila. In Non-Neutral Evolu-tion: Theories and Molecular Data (ed. B. Golding), pp. 46-56. Chapman and Hall, London.

Bailey, J.A., Z. Gu, R.A. Clark, K. Reinert, R.V. Samonte, S. Schwartz, M.D. Adams, E.W. Myers, P.W. Li, and E.E. Eichler. 2002. Recent Segmental Duplications in the Human Genome. Science 297: 1003-1007.

Begun, D.J. and C.F. Aquadro. 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356: 519-520.

Bell, G.I. 1996. Evolution of simple sequence repeats. Comput Chem 20: 41-48.

Bell, G.I. and J. Jurka. 1997. The length distribution of perfect dimer repeti-tive DNA is consistent with its evolution by an unbiased single-step mutation process. J Mol Evol 44: 414-421.

Berg, I., R. Neumann, H. Cederberg, U. Rannug, and A.J. Jeffreys. 2003. Two modes of germline instability at human minisatellite MS1 (lo-cus D1S7): complex rearrangements and paradoxical hyperdeletion. Am J Hum Genet 72: 1436-1447.

Bergman, C.M. and M. Kreitman. 2001. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and in-tronic sequences. Genome Res 11: 1335-1345.

Berlin, S., M. Brandstrom, N. Backstrom, E. Axelsson, N.G. Smith, and H. Ellegren. 2006. Substitution rate heterogeneity and the male muta-tion bias. J Mol Evol 62: 226-233.

Page 46: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

46

Bhangale, T.R., M.J. Rieder, R.J. Livingston, and D.A. Nickerson. 2005. Comprehensive identification and characterization of diallelic inser-tion-deletion polymorphisms in 330 human candidate genes. Hum Mol Genet 14: 59-69.

Bierne, H., S.D. Ehrlich, and B. Michel. 1997. Deletions at stalled replica-tion forks occur by two different pathways. Embo J 16: 3332-3340.

Bierne, H. and B. Michel. 1994. When replication forks stop. Mol Microbiol 13: 17-23.

Boffelli, D., J. McAuliffe, D. Ovcharenko, K.D. Lewis, I. Ovcharenko, L. Pachter, and E.M. Rubin. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 1391-1394.

Bois, P.R. 2003. Hypermutable minisatellites, a human affair? Genomics 81: 349-355.

Bray, N. and L. Pachter. 2003. MAVID multiple alignment server. Nucleic Acids Res 31: 3525-3526.

Bray, N. and L. Pachter. 2004. MAVID: constrained ancestral alignment of multiple sequences. Genome Res 14: 693-699.

Brinkmann, B., M. Klintschar, F. Neuhuber, J. Huhne, and B. Rolf. 1998. Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am J Hum Genet 62: 1408-1415.

Britten, R.J. 2002. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proc Natl Acad Sci U S A 99: 13633-13635.

Britten, R.J., L. Rowen, J. Williams, and R.A. Cameron. 2003. Majority of divergence between closely related DNA samples is due to indels. Proc Natl Acad Sci U S A 100: 4661-4665.

Brudno, M., C.B. Do, G.M. Cooper, M.F. Kim, E. Davydov, E.D. Green, A. Sidow, and S. Batzoglou. 2003. LAGAN and Multi-LAGAN: effi-cient tools for large-scale multiple alignment of genomic DNA. Ge-nome Res 13: 721-731.

Buschiazzo, E. and N.J. Gemmell. 2006. The rise, fall and renaissance of microsatellites in eukaryotic genomes. Bioessays 28: 1040-1050.

Calabrese, P.P., R.T. Durrett, and C.F. Aquadro. 2001. Dynamics of mi-crosatellite divergence under stepwise mutation and proportional slippage/point mutation models. Genetics 159: 839-852.

Cannon, S.B., A. Kozik, B. Chan, R. Michelmore, and N.D. Young. 2003. DiagHunter and GenoPix2D: programs for genomic comparisons, large-scale homology discovery and visualization. Genome Biol 4: R68.

Charlebois, R.L. and A. St Jean. 1995. Supercoiling and map stability in the bacterial chromosome. J Mol Evol 41: 15-23.

Charlesworth, B. 1994. The effect of background selection against deleteri-ous mutations on weakly selected, linked variants. Genet Res 63: 213-227.

Page 47: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

47

Charlesworth, B., M.T. Morgan, and D. Charlesworth. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134: 1289-1303.

Charlesworth, D., B. Charlesworth, and M.T. Morgan. 1995. The pattern of neutral molecular variation under the background selection model. Genetics 141: 1619-1632.

Chen, F.C., C.J. Chen, W.H. Li, and T.J. Chuang. 2007. Human-specific insertions and deletions inferred from mammalian genome se-quences. Genome Res 17: 16-22.

Chen, J.M., N. Chuzhanova, P.D. Stenson, C. Ferec, and D.N. Cooper. 2005a. Meta-analysis of gross insertions causing human genetic dis-ease: novel mutational mechanisms and the role of replication slip-page. Hum Mutat 25: 207-221.

Chen, J.M., P.D. Stenson, D.N. Cooper, and C. Ferec. 2005b. A systematic analysis of LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease. Hum Genet 117: 411-427.

Chimpanzee Sequencing and Analysis Consortium. 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437: 69-87.

Clark, A.G. 2001. The search for meaning in noncoding DNA. Genome Res 11: 1319-1320.

Comeron, J.M. and M. Kreitman. 2000. The correlation between intron length and recombination in drosophila. Dynamic equilibrium be-tween mutational and selective forces. Genetics 156: 1175-1190.

Contente, A., A. Dittmer, M.C. Koch, J. Roth, and M. Dobbelstein. 2002. A polymorphic microsatellite that mediates induction of PIG3 by p53. Nat Genet 30: 315-320.

Cooper, G.M., M. Brudno, E.D. Green, S. Batzoglou, and A. Sidow. 2003. Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res 13: 813-820.

Cooper, G.M., M. Brudno, E.A. Stone, I. Dubchak, S. Batzoglou, and A. Sidow. 2004. Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res 14: 539-548.

Corneo, G., E. Ginelli, and E. Polli. 1967. A Satellite DNA isolated from human tissue. J Mol Biol 23: 619-622.

Cornuet, J.M., M.A. Beaumont, A. Estoup, and M. Solignac. 2006. Inference on microsatellite mutation processes in the invasive mite, Varroa de-structor, using reversible jump Markov chain Monte Carlo. Theor Popul Biol 69: 129-144.

Cullen, M., S.P. Perfetto, W. Klitz, G. Nelson, and M. Carrington. 2002. High-resolution patterns of meiotic recombination across the human major histocompatibility complex. Am J Hum Genet 71: 759-776.

Curi, R.A., H.N. Oliveira, A.C. Silveira, and C.R. Lopes. 2005. Effects of polymorphic microsatellites in the regulatory region of IGF1 and GHR on growth and carcass traits in beef cattle. Anim Genet 36: 58-62.

Page 48: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

48

Dermitzakis, E.T. and A.G. Clark. 2002. Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol Biol Evol 19: 1114-1121.

Di Rienzo, A., A.C. Peterson, J.C. Garza, A.M. Valdes, M. Slatkin, and N.B. Freimer. 1994. Mutational processes of simple-sequence repeat loci in human populations. Proc Natl Acad Sci U S A 91: 3166-3170.

Dieringer, D. and C. Schlötterer. 2003. Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res 13: 2242-2251.

Drake, J.A., C. Bird, J. Nemesh, D.J. Thomas, C. Newton-Cheh, A. Rey-mond, L. Excoffier, H. Attar, S.E. Antonarakis, E.T. Dermitzakis, and J.N. Hirschhorn. 2006. Conserved noncoding sequences are se-lectively constrained and not mutation cold spots. Nat Genet 38: 223-227.

Dubchak, I., M. Brudno, G.G. Loots, L. Pachter, C. Mayor, E.M. Rubin, and K.A. Frazer. 2000. Active conservation of noncoding sequences re-vealed by three-way species comparisons. Genome Res 10: 1304-1306.

Eddy, S.R. 1999. Noncoding RNA genes. Curr Opin Genet Dev 9: 695-699. Edwards, Y.J., T.J. Carver, T. Vavouri, M. Frith, M.J. Bishop, and G. Elgar.

2003. Theatre: A software tool for detailed comparative analysis and visualization of genomic sequence. Nucleic Acids Res 31: 3510-3517.

Ellegren, H. 2000. Heterogeneous mutation processes in human microsatel-lite DNA sequences. Nat Genet 24: 400-402.

Ellegren, H. 2004. Microsatellites: simple sequences with complex evolu-tion. Nat Rev Genet 5: 435-445.

Ellegren, H., S. Moore, N. Robinson, K. Byrne, W. Ward, and B.C. Sheldon. 1997. Microsatellite evolution--a reciprocal study of repeat lengths at homologous loci in cattle and sheep. Mol Biol Evol 14: 854-860.

Ellegren, H., C.R. Primmer, and B.C. Sheldon. 1995. Microsatellite 'evolu-tion': directionality or bias? Nat Genet 11: 360-362.

Ellegren, H., N.G. Smith, and M.T. Webster. 2003. Mutation rate variation in the mammalian genome. Curr Opin Genet Dev 13: 562-568.

Elnitski, L., R.C. Hardison, J. Li, S. Yang, D. Kolbe, P. Eswara, M.J. O'Connor, S. Schwartz, W. Miller, and F. Chiaromonte. 2003. Dis-tinguishing regulatory DNA from neutral sites. Genome Res 13: 64-72.

Eyre-Walker, A. 2006. The genomic rate of adaptive evolution. Trends Ecol Evol 21: 569-575.

Fain, M.G. and P. Houde. 2004. Parallel radiations in the primary clades of birds. Evolution Int J Org Evolution 58: 2558-2573.

Filipski, A. and S. Kumar. 2005. Comparative Genomics in Eukaryotes. In The Evolutions of the Genome (ed. T.R. Gregory), pp. 521-583. El-sevier Academic Press, Burlington, MA.

Finnegan, D.J. 1989. Eukaryotic transposable elements and genome evolu-tion. Trends Genet 5: 103-107.

Page 49: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

49

Gaffney, D.J. and P.D. Keightley. 2005. The scale of mutational variation in the murid genome. Genome Res 15: 1086-1094.

Galtier, N. and L. Duret. 2007. Adaptation or biased gene conversion? Ex-tending the null hypothesis of molecular evolution. Trends Genet 23: 273-277.

Gendrel, C.G., A. Boulet, and M. Dutreix. 2000. (CA/GT)(n) microsatellites affect homologous recombination during yeast meiosis. Genes Dev 14: 1261-1268.

Gibbs, R.A. G.M. Weinstock M.L. Metzker D.M. Muzny E.J. Sodergren S. Scherer G. Scott D. Steffen K.C. Worley P.E. Burch G. Okwuonu S. Hines L. Lewis C. DeRamo O. Delgado S. Dugan-Rocha G. Miner M. Morgan A. Hawes R. Gill Celera R.A. Holt M.D. Adams P.G. Amanatides H. Baden-Tillson M. Barnstead S. Chin C.A. Evans S. Ferriera C. Fosler A. Glodek Z. Gu D. Jennings C.L. Kraft T. Nguyen C.M. Pfannkoch C. Sitter G.G. Sutton J.C. Venter T. Woo-dage D. Smith H.M. Lee E. Gustafson P. Cahill A. Kana L. Doucette-Stamm K. Weinstock K. Fechtel R.B. Weiss D.M. Dunn E.D. Green R.W. Blakesley G.G. Bouffard P.J. De Jong K. Osoe-gawa B. Zhu M. Marra J. Schein I. Bosdet C. Fjell S. Jones M. Krzywinski C. Mathewson A. Siddiqui N. Wye J. McPherson S. Zhao C.M. Fraser J. Shetty S. Shatsman K. Geer Y. Chen S. Abramzon W.C. Nierman P.H. Havlak R. Chen K.J. Durbin A. Egan Y. Ren X.Z. Song B. Li Y. Liu X. Qin S. Cawley K.C. Worley A.J. Cooney L.M. D'Souza K. Martin J.Q. Wu M.L. Gonzalez-Garay A.R. Jackson K.J. Kalafus M.P. McLeod A. Milosavljevic D. Virk A. Volkov D.A. Wheeler Z. Zhang J.A. Bailey E.E. Eichler E. Tu-zun E. Birney E. Mongin A. Ureta-Vidal C. Woodwark E. Zdobnov P. Bork M. Suyama D. Torrents M. Alexandersson B.J. Trask J.M. Young H. Huang H. Wang H. Xing S. Daniels D. Gietzen J. Schmidt K. Stevens U. Vitt J. Wingrove F. Camara M. Mar Alba J.F. Abril R. Guigo A. Smit I. Dubchak E.M. Rubin O. Couronne A. Poliakov N. Hubner D. Ganten C. Goesele O. Hummel T. Kreitler Y.A. Lee J. Monti H. Schulz H. Zimdahl H. Himmelbauer H. Lehrach H.J. Jacob S. Bromberg J. Gullings-Handley M.I. Jensen-Seaman A.E. Kwitek J. Lazar D. Pasko P.J. Tonellato S. Twigger C.P. Ponting J.M. Duarte S. Rice L. Goodstadt S.A. Beatson R.D. Emes E.E. Winter C. Webber P. Brandt G. Nyakatura M. Adetobi F. Chiaromonte L. El-nitski P. Eswara R.C. Hardison M. Hou D. Kolbe K. Makova W. Miller A. Nekrutenko C. Riemer S. Schwartz J. Taylor S. Yang Y. Zhang K. Lindpaintner T.D. Andrews M. Caccamo M. Clamp L. Clarke V. Curwen R. Durbin E. Eyras S.M. Searle G.M. Cooper S. Batzoglou M. Brudno A. Sidow E.A. Stone J.C. Venter B.A. Pay-seur G. Bourque C. Lopez-Otin X.S. Puente K. Chakrabarti S. Chat-terji C. Dewey L. Pachter N. Bray V.B. Yap A. Caspi G. Tesler P.A. Pevzner D. Haussler K.M. Roskin R. Baertsch H. Clawson T.S. Furey A.S. Hinrichs D. Karolchik W.J. Kent K.R. Rosenbloom H. Trumbower M. Weirauch D.N. Cooper P.D. Stenson B. Ma M.

Page 50: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

50

Brent M. Arumugam D. Shteynberg R.R. Copley M.S. Taylor H. Riethman U. Mudunuri J. Peterson M. Guyer A. Felsenfeld S. Old S. Mockrin and F. Collins. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493-521.

Glazko, G.V., E.V. Koonin, I.B. Rogozin, and S.A. Shabalina. 2003. A sig-nificant fraction of conserved noncoding DNA in human and mouse consists of predicted matrix attachment regions. Trends Genet 19: 119-124.

Gregory, T.R. 2003. Is small indel bias a determinant of genome size? Trends Genet 19: 485-488.

Gregory, T.R. 2004. Insertion-deletion biases and the evolution of genome size. Gene 324: 15-34.

Gregory, T.R. 2005a. Genome Size Evolution in Animals. In The Evolution of the Genome (ed. T.R. Gregory), pp. 3-87. Elsevier Academic Press, Burlington, MA.

Gregory, T.R. 2005b. Synergy between sequence and size in large-scale genomics. Nat Rev Genet 6: 699-708.

Gusmao, L., P. Sanchez-Diz, F. Calafell, P. Martin, C.A. Alonso, F. Alva-rez-Fernandez, C. Alves, L. Borjas-Fajardo, W.R. Bozzo, M.L. Bravo, J.J. Builes, J. Capilla, M. Carvalho, C. Castillo, C.I. Catanesi, D. Corach, A.M. Di Lonardo, R. Espinheira, E. Fagundes de Car-valho, M.J. Farfan, H.P. Figueiredo, I. Gomes, M.M. Lojo, M. Mar-ino, M.F. Pinheiro, M.L. Pontes, V. Prieto, E. Ramos-Luis, J.A. Ri-ancho, A.C. Souza Goes, O.A. Santapa, D.R. Sumita, G. Vallejo, L. Vidal Rioja, M.C. Vide, C.I. Vieira da Silva, M.R. Whittle, W. Za-bala, M.T. Zarrabeitia, A. Alonso, A. Carracedo, and A. Amorim. 2005. Mutation rates at Y chromosome specific microsatellites. Hum Mutat 26: 520-528.

Hamada, H. and T. Kakunaga. 1982. Potential Z-DNA forming sequences are highly dispersed in the human genome. Nature 298: 396-398.

Hamada, H., M.G. Petrino, and T. Kakunaga. 1982. A novel repeated ele-ment with Z-DNA-forming potential is widely found in evolutionar-ily diverse eukaryotic genomes. Proc Natl Acad Sci U S A 79: 6465-6469.

Hamada, H., M. Seidman, B.H. Howard, and C.M. Gorman. 1984. Enhanced gene expression by the poly(dT-dG).poly(dC-dA) sequence. Mol Cell Biol 4: 2622-2630.

Hamilton, M.B., J.M. Braverman, and D.F. Soria-Hernanz. 2003. Patterns and relative rates of nucleotide and insertion/deletion evolution at six chloroplast intergenic regions in new world species of the Le-cythidaceae. Mol Biol Evol 20: 1710-1721.

Hammock, E.A. and L.J. Young. 2004. Functional microsatellite polymor-phism associated with divergent social structure in vole species. Mol Biol Evol 21: 1057-1063.

Page 51: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

51

Hammock, E.A. and L.J. Young. 2005. Microsatellite instability generates diversity in brain and sociobehavioral traits. Science 308: 1630-1634.

Hardison, R.C. 2003. Comparative genomics. PLoS Biol 1: E58. Hardison, R.C., K.M. Roskin, S. Yang, M. Diekhans, W.J. Kent, R. Weber,

L. Elnitski, J. Li, M. O'Connor, D. Kolbe, S. Schwartz, T.S. Furey, S. Whelan, N. Goldman, A. Smit, W. Miller, F. Chiaromonte, and D. Haussler. 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Ge-nome Res 13: 13-26.

Hare, M.P. and S.R. Palumbi. 2003. High intron sequence conservation across three mammalian orders suggests functional constraints. Mol Biol Evol 20: 969-978.

Hashem, V.I., W.A. Rosche, and R.R. Sinden. 2004. Genetic recombination destabilizes (CTG)n.(CAG)n repeats in E. coli. Mutat Res 554: 95-109.

Hellmann, I., I. Ebersberger, S.E. Ptak, S. Paabo, and M. Przeworski. 2003. A neutral explanation for the correlation of diversity with recombi-nation rates in humans. Am J Hum Genet 72: 1527-1535.

Henderson, S.T. and T.D. Petes. 1992. Instability of simple sequence DNA in Saccharomyces cerevisiae. Mol Cell Biol 12: 2749-2757.

Hile, S.E. and K.A. Eckert. 2004. Positive correlation between DNA poly-merase pausing and mutagenesis within polypyrimidine/polypurine microsatellite sequences. J Mol Biol 335: 745-759.

Holmes, I. 2005. Using evolutionary Expectation Maximization to estimate indel rates. Bioinformatics 21: 2294-2300.

Huang, Q.Y., F.H. Xu, H. Shen, H.Y. Deng, Y.J. Liu, Y.Z. Liu, J.L. Li, R.R. Recker, and H.W. Deng. 2002. Mutation patterns at dinucleotide mi-crosatellite loci in humans. Am J Hum Genet 70: 625-634.

Hudson, R.R. 1994. How can the low levels of DNA sequence variation in regions of the drosophila genome with low recombination rates be explained? Proc Natl Acad Sci U S A 91: 6815-6818.

Hudson, R.R. and N.L. Kaplan. 1995. Deleterious background selection with recombination. Genetics 141: 1605-1617.

Hudson, R.R., M. Kreitman, and M. Aguade. 1987. A test of neutral molecu-lar evolution based on nucleotide data. Genetics 116: 153-159.

International Chicken Genome Sequencing Consortium. 2004. Sequence and comparative analysis of the chicken genome provide unique per-spectives on vertebrate evolution. Nature 432: 695-716.

International Chicken Polymorphism Map Consortium. 2004. A genetic variation map for chicken with 2.8 million single-nucleotide poly-morphisms. Nature 432: 717-722.

International HapMap Consortium. 2005. A haplotype map of the human genome. Nature 437: 1299-1320.

International Human Genome Sequencing Consortium. 2001. Initial se-quencing and analysis of the human genome. Nature 409: 860-921.

Page 52: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

52

Isobe, T., M. Yoshino, K. Mizuno, K. Lindahl, T. Koide, S. Gaudieri, T. Gojobori, and T. Shiroishi. 2002. Molecular Characterization of the Pb Recombination Hotspot in the Mouse Major Histocompatibility Complex Class II Region. Genomics 80: 229.

Jeffreys, A.J., R. Barber, P. Bois, J. Buard, Y.E. Dubrova, G. Grant, C.R. Hollies, C.A. May, R. Neumann, M. Panayi, A.E. Ritchie, A.C. Shone, E. Signer, J.D. Stead, and K. Tamaki. 1999. Human minisatellites, repeat DNA instability and meiotic recombination. Electrophoresis 20: 1665-1675.

Jeffreys, A.J., L. Kauppi, and R. Neumann. 2001. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet 29: 217-222.

Jeffreys, A.J., J. Murray, and R. Neumann. 1998. High-resolution mapping of crossovers in human sperm defines a minisatellite-associated re-combination hotspot. Mol Cell 2: 267-273.

Jeffreys, A.J., V. Wilson, and S.L. Thein. 1985. Hypervariable 'minisatellite' regions in human DNA. Nature 314: 67-73.

Jensen-Seaman, M.I., T.S. Furey, B.A. Payseur, Y. Lu, K.M. Roskin, C.F. Chen, M.A. Thomas, D. Haussler, H.J. Jacob, I. Hellmann, K. Prufer, H. Ji, M.C. Zody, S. Paabo, S.E. Ptak, A. Yu, C. Zhao, Y. Fan, W. Jang, A.J. Mungall, P. Deloukas, A. Olsen, N.A. Doggett, N. Ghebranious, K.W. Broman, J.L. Weber, A. Kong, D.F. Gudbjartsson, J. Sainz, G.M. Jonsdottir, S.A. Gudjonsson, B. Rich-ardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson, A. Shlien, S.T. Palsson, M.L. Frigge, T.E. Thorgeirsson, J.R. Gulcher, and K. Stefansson. 2004. Comparative recombination rates in the rat, mouse, and human genomes. Genome Res 14: 528-538.

Kaplan, N.L., R.R. Hudson, and C.H. Langley. 1989. The "hitchhiking ef-fect" revisited. Genetics 123: 887-899.

Kashi, Y., D. King, and M. Soller. 1997. Simple sequence repeats as a source of quantitative genetic variation. Trends Genet 13: 74-78.

Kashi, Y. and D.G. King. 2006. Simple sequence repeats as advantageous mutators in evolution. Trends Genet 22: 253-259.

Kawakita, A., T. Sota, J.S. Ascher, M. Ito, H. Tanaka, and M. Kato. 2003. Evolution and phylogenetic utility of alignment gaps within intron sequences of three nuclear genes in bumble bees (Bombus). Mol Biol Evol 20: 87-92.

Kayser, M., L. Roewer, M. Hedman, L. Henke, J. Henke, S. Brauer, C. Kruger, M. Krawczak, M. Nagy, T. Dobosz, R. Szibor, P. de Knijff, M. Stoneking, and A. Sajantila. 2000. Characteristics and frequency of germline mutations at microsatellite loci from the human Y chromosome, as revealed by direct observation in father/son pairs. Am J Hum Genet 66: 1580-1588.

Kayser, M., E.J. Vowles, D. Kappei, and W. Amos. 2006. Microsatellite length differences between humans and chimpanzees at autosomal Loci are not found at equivalent haploid Y chromosomal Loci. Ge-netics 173: 2179-2186.

Page 53: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

53

Kent, W.J., R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. 2003. Evo-lution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 100: 11484-11489.

Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217: 624-626.

Kofler, R., C. Schlotterer, and T. Lelley. 2007. SciRoKo: A new tool for whole genome microsatellite search and investigation. Bioinformat-ics.

Kong, A., D.F. Gudbjartsson, J. Sainz, G.M. Jonsdottir, S.A. Gudjonsson, B. Richardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson, A. Shlien, S.T. Palsson, M.L. Frigge, T.E. Thorgeirsson, J.R. Gulcher, and K. Stefansson. 2002. A high-resolution recombination map of the human genome. Nat Genet 31: 241-247.

Kozlowski, J., M. Konarzewski, and A.T. Gawelczyk. 2003. Cell size as a link between noncoding DNA and metabolic rate scaling. Proc Natl Acad Sci U S A 100: 14080-14085.

Kruglyak, S., R.T. Durrett, M.D. Schug, and C.F. Aquadro. 1998. Equilib-rium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc Natl Acad Sci U S A 95: 10774-10778.

Lercher, M.J. and L.D. Hurst. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet 18: 337-340.

Levinson, G. and G.A. Gutman. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol 4: 203-221.

Li, W.H. 1997. Molecular Evolution. Sinauer Associates Inc, Sunderland, MA.

Li, Y.C., A.B. Korol, T. Fahima, A. Beiles, and E. Nevo. 2002. Microsatel-lites: genomic distribution, putative functions and mutational mechanisms: a review. Mol Ecol 11: 2453-2465.

Li, Y.C., A.B. Korol, T. Fahima, and E. Nevo. 2004. Microsatellites within genes: structure, function, and evolution. Mol Biol Evol 21: 991-1007.

Lindblad, K. and M. Schalling. 1999. Expanded repeat sequences and dis-ease. Semin Neurol 19: 289-299.

Lindblad-Toh, K. C.M. Wade T.S. Mikkelsen E.K. Karlsson D.B. Jaffe M. Kamal M. Clamp J.L. Chang E.J. Kulbokas, 3rd M.C. Zody E. Mauceli X. Xie M. Breen R.K. Wayne E.A. Ostrander C.P. Ponting F. Galibert D.R. Smith P.J. DeJong E. Kirkness P. Alvarez T. Biagi W. Brockman J. Butler C.W. Chin A. Cook J. Cuff M.J. Daly D. DeCaprio S. Gnerre M. Grabherr M. Kellis M. Kleber C. Bardeleben L. Goodstadt A. Heger C. Hitte L. Kim K.P. Koepfli H.G. Parker J.P. Pollinger S.M. Searle N.B. Sutter R. Thomas C. Webber J. Baldwin A. Abebe A. Abouelleil L. Aftuck M. Ait-Zahra T. Aldredge N. Allen P. An S. Anderson C. Antoine H. Arachchi A. Aslam L. Ayotte P. Bachantsang A. Barry T. Bayul M. Benamara A.

Page 54: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

54

Berlin D. Bessette B. Blitshteyn T. Bloom J. Blye L. Boguslavskiy C. Bonnet B. Boukhgalter A. Brown P. Cahill N. Calixte J. Camarata Y. Cheshatsang J. Chu M. Citroen A. Collymore P. Cooke T. Dawoe R. Daza K. Decktor S. DeGray N. Dhargay K. Dooley K. Dooley P. Dorje K. Dorjee L. Dorris N. Duffey A. Dupes O. Egbi-remolen R. Elong J. Falk A. Farina S. Faro D. Ferguson P. Ferreira S. Fisher M. FitzGerald K. Foley C. Foley A. Franke D. Friedrich D. Gage M. Garber G. Gearin G. Giannoukos T. Goode A. Goyette J. Graham E. Grandbois K. Gyaltsen N. Hafez D. Hagopian B. Hagos J. Hall C. Healy R. Hegarty T. Honan A. Horn N. Houde L. Hughes L. Hunnicutt M. Husby B. Jester C. Jones A. Kamat B. Kanga C. Kells D. Khazanovich A.C. Kieu P. Kisner M. Kumar K. Lance T. Landers M. Lara W. Lee J.P. Leger N. Lennon L. Leuper S. LeVine J. Liu X. Liu Y. Lokyitsang T. Lokyitsang A. Lui J. Macdonald J. Major R. Marabella K. Maru C. Matthews S. McDonough T. Mehta J. Meldrim A. Melnikov L. Meneus A. Mihalev T. Mihova K. Miller R. Mittelman V. Mlenga L. Mulrain G. Munson A. Navidi J. Naylor T. Nguyen N. Nguyen C. Nguyen T. Nguyen R. Nicol N. Norbu C. Norbu N. Novod T. Nyima P. Olandt B. O'Neill K. O'Neill S. Os-man L. Oyono C. Patti D. Perrin P. Phunkhang F. Pierre M. Priest A. Rachupka S. Raghuraman R. Rameau V. Ray C. Raymond F. Rege C. Rise J. Rogers P. Rogov J. Sahalie S. Settipalli T. Sharpe T. Shea M. Sheehan N. Sherpa J. Shi D. Shih J. Sloan C. Smith T. Sparrow J. Stalker N. Stange-Thomann S. Stavropoulos C. Stone S. Stone S. Sykes P. Tchuinga P. Tenzing S. Tesfaye D. Thoulutsang Y. Thou-lutsang K. Topham I. Topping T. Tsamla H. Vassiliev V. Venkata-raman A. Vo T. Wangchuk T. Wangdi M. Weiand J. Wilkinson A. Wilson S. Yadav S. Yang X. Yang G. Young Q. Yu J. Zainoun L. Zembek A. Zimmer and E.S. Lander. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438: 803-819.

Litt, M. and J.A. Luty. 1989. A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. Am J Hum Genet 44: 397-401.

Loots, G.G., R.M. Locksley, C.M. Blankespoor, Z.E. Wang, W. Miller, E.M. Rubin, and K.A. Frazer. 2000. Identification of a coordinate regula-tor of interleukins 4, 13, and 5 by cross-species sequence compari-sons. Science 288: 136-140.

Ludwig, M.Z. 2002. Functional evolution of noncoding DNA. Curr Opin Genet Dev 12: 634-639.

Ludwig, M.Z., C. Bergman, N.H. Patel, and M. Kreitman. 2000. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature 403: 564-567.

Lynch, M. 2007. The Origins of Genome Architecture. Sinauer, Sunderland, MA.

Lyons, L.-A., M.-M. Raymond, and S.-J. O'Brien. 1994. Comparative ge-nomics: The next generation. Animal Biotechnology 5: 103-111.

Page 55: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

55

Ma, J., L. Zhang, B.B. Suh, B.J. Raney, R.C. Burhans, W.J. Kent, M. Blanchette, D. Haussler, and W. Miller. 2006. Reconstructing con-tiguous regions of an ancestral genome. Genome Res 16: 1557-1565.

Majewski, J. and J. Ott. 2000. GT repeats are associated with recombination on human chromosome 22. Genome Res 10: 1108-1114.

Makova, K.D., S. Yang, and F. Chiaromonte. 2004. Insertions and deletions are male biased too: a whole-genome analysis in rodents. Genome Res 14: 567-573.

Margulies, E.H., M. Blanchette, D. Haussler, and E.D. Green. 2003. Identifi-cation and characterization of multi-species conserved sequences. Genome Res 13: 2507-2518.

Maynard Smith, J. and J. Haigh. 1974. The hitch-hiking effect of a favour-able gene. Genet Res 23: 23-35.

Mayor, C., M. Brudno, J.R. Schwartz, A. Poliakov, E.M. Rubin, K.A. Fra-zer, L.S. Pachter, and I. Dubchak. 2000. VISTA : visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16: 1046-1047.

McDonald, J.H. and M. Kreitman. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351: 652-654.

McVean, G.A., S.R. Myers, S. Hunt, P. Deloukas, D.R. Bentley, and P. Donnelly. 2004. The fine-scale structure of recombination rate varia-tion in the human genome. Science 304: 581-584.

Mills, R.E., C.T. Luttig, C.E. Larkins, A. Beauchamp, C. Tsui, W.S. Pittard, and S.E. Devine. 2006. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16: 1182-1190.

Moore, E. 1965. Cramming more components onto integrated circuits. Elec-tronics 38.

Morgante, M., M. Hanafey, and W. Powell. 2002. Microsatellites are prefer-entially associated with nonrepetitive DNA in plant genomes. Nat Genet 30: 194-200.

Morral, N., V. Nunes, T. Casals, and X. Estivill. 1991. CA/GT microsatellite alleles within the cystic fibrosis transmembrane conductance regula-tor (CFTR) gene are not generated by unequal crossingover. Genom-ics 10: 692-698.

Mouse Genome Sequencing Consortium. 2002. Initial sequencing and com-parative analysis of the mouse genome. Nature 420: 520-562.

Myers, S., L. Bottolo, C. Freeman, G. McVean, and P. Donnelly. 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science 310: 321-324.

Müller, K. 2006. Incorporating information from length-mutational events into phylogenetic analysis. Mol Phylogenet Evol 38: 667-676.

Nachman, M.W. 1997. Patterns of DNA variability at X-linked loci in Mus domesticus. Genetics 147: 1303-1316.

Nachman, M.W. 2001. Single nucleotide polymorphisms and recombination rate in humans. Trends Genet 17: 481-485.

Page 56: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

56

Nachman, M.W., V.L. Bauer, S.L. Crowell, and C.F. Aquadro. 1998. DNA variability and recombination rates at X-linked loci in humans. Ge-netics 150: 1133-1141.

Naylor, L.H. and E.M. Clark. 1990. d(TG)n.d(CA)n sequences upstream of the rat prolactin gene form Z-DNA and inhibit gene transcription. Nucleic Acids Res 18: 1595-1601.

Neafsey, D.E. and S.R. Palumbi. 2003. Genome size evolution in pufferfish: a comparative analysis of diodontid and tetraodontid pufferfish ge-nomes. Genome Res 13: 821-830.

Ning, Z., A.J. Cox, and J.C. Mullikin. 2001. SSAHA: a fast search method for large DNA databases. Genome Res 11: 1725-1729.

Nordborg, M., B. Charlesworth, and D. Charlesworth. 1996. The effect of recombination on background selection. Genet Res 67: 159-174.

Ogurtsov, A.Y., S. Sunyaev, and A.S. Kondrashov. 2004. Indel-based evolu-tionary distance and mouse-human divergence. Genome Res 14: 1610-1616.

Ohta, T. 2002. Near-neutrality in evolution of genes and gene regulation. Proc Natl Acad Sci U S A 99: 16134-16137.

Ometto, L., W. Stephan, and D. De Lorenzo. 2005. Insertion/deletion and nucleotide polymorphism data reveal constraints in Drosophila melanogaster introns and intergenic regions. Genetics 169: 1521-1527.

Ophir, R. and D. Graur. 1997. Patterns and rates of indel evolution in proc-essed pseudogenes from humans and murids. Gene 205: 191-202.

Pardi, F., R.M. Sibly, M.J. Wilkinson, and J.C. Whittaker. 2005. On the structural differences between markers and genomic AC microsatel-lites. J Mol Biol 60: 688-693.

Parkinson, J. and M. Blaxter. 2003. SimiTri--visualizing similarity relation-ships for groups of sequences. Bioinformatics 19: 390-395.

Payseur, B.A. and M.W. Nachman. 2000. Microsatellite variation and re-combination rate in the human genome. Genetics 156: 1285-1298.

Petrov, D.A. 2002a. DNA loss and evolution of genome size in Drosophila. Genetica 115: 81-91.

Petrov, D.A. 2002b. Mutational equilibrium model of genome size evolu-tion. Theor Popul Biol 61: 531-544.

Petrov, D.A., T.A. Sangster, J.S. Johnston, D.L. Hartl, and K.L. Shaw. 2000. Evidence for DNA loss as a determinant of genome size. Science 287: 1060-1062.

Pollard, K.S., S.R. Salama, N. Lambert, M.A. Lambot, S. Coppens, J.S. Pedersen, S. Katzman, B. King, C. Onodera, A. Siepel, A.D. Kern, C. Dehay, H. Igel, M. Ares, Jr., P. Vanderhaeghen, and D. Haussler. 2006. An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443: 167-172.

Price, A.L., E. Eskin, and P.A. Pevzner. 2004. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res 14: 2245-2252.

Page 57: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

57

Primmer, C.R., N. Saino, A.P. Moller, and H. Ellegren. 1996. Directional evolution in germline microsatellite mutations. Nat Genet 13: 391-393.

Ptak, S.E., D.A. Hinds, K. Koehler, B. Nickel, N. Patil, D.G. Ballinger, M. Przeworski, K.A. Frazer, and S. Paabo. 2005. Fine-scale recombina-tion patterns differ between chimpanzees and humans. Nat Genet 37: 429-434.

Rajeevan, H., M.V. Osier, K.H. Cheung, H. Deng, L. Druskin, R. Heinzen, J.R. Kidd, S. Stein, A.J. Pakstis, N.P. Tosches, C.C. Yeh, P.L. Miller, and K.K. Kidd. 2003. ALFRED: the ALelle FREquency Da-tabase. Update. Nucleic Acids Res 31: 270-271.

Rana, N.A., N.D. Ebenezer, A.R. Webster, A.R. Linares, D.B. Whitehouse, S. Povey, and A.J. Hardcastle. 2004. Recombination hotspots and block structure of linkage disequilibrium in the human genome ex-emplified by detailed analysis of PGM1 on 1p31. Hum Mol Genet 13: 3089-3102.

Rozas, J., J.C. Sanchez-DelBarrio, X. Messeguer, and R. Rozas. 2003. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 19: 2496-2497.

Sainudiin, R., R.T. Durret, C.F. Aquadro, and R. Nielsen. 2004. Microsatel-lite Mutation Models: Insights From a Comparison of Humans and Chimpanzees. Genetics 168: 383-395.

Samonte, R.V. and E.E. Eichler. 2002. Segmental duplications and the evo-lution of the primate genome. Nat Rev Genet 3: 65-72.

Santoro, C., F. Costanzo, and G. Ciliberto. 1984. Inhibition of eukaryotic tRNA transcription by potential Z-DNA sequences. Embo J 3: 1553-1559.

Schlötterer, C. and D. Tautz. 1992. Slippage synthesis of simple sequence DNA. Nucleic Acids Res 20: 211-215.

Schultes, N.P. and J.W. Szostak. 1991. A poly(dA.dT) tract is a component of the recombination initiation site at the ARG4 locus in Saccharo-myces cerevisiae. Mol Cell Biol 11: 322-328.

Schwartz, S., W.J. Kent, A. Smit, Z. Zhang, R. Baertsch, R.C. Hardison, D. Haussler, and W. Miller. 2003. Human-mouse alignments with BLASTZ. Genome Res 13: 103-107.

Schwartz, S., Z. Zhang, K.A. Frazer, A. Smit, C. Riemer, J. Bouck, R. Gibbs, R. Hardison, and W. Miller. 2000. PipMaker--a web server for aligning two genomic DNA sequences. Genome Res 10: 577-586.

Selkoe, K.A. and R.J. Toonen. 2006. Microsatellites for ecologists: a practi-cal guide to using and evaluating microsatellite markers. Ecol Lett 9: 615-629.

Shabalina, S.A., A.Y. Ogurtsov, V.A. Kondrashov, and A.S. Kondrashov. 2001. Selective constraint in intergenic regions of human and mouse genomes. Trends Genet 17: 373-376.

Sharp, A.J., D.P. Locke, S.D. McGrath, Z. Cheng, J.A. Bailey, R.U. Vallente, L.M. Pertz, R.A. Clark, S. Schwartz, R. Segraves, V.V.

Page 58: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

58

Oseroff, D.G. Albertson, D. Pinkel, and E.E. Eichler. 2005. Segmen-tal duplications and copy-number variation in the human genome. Am J Hum Genet 77: 78-88.

Siepel, A., G. Bejerano, J.S. Pedersen, A.S. Hinrichs, M. Hou, K. Rosen-bloom, H. Clawson, J. Spieth, L.W. Hillier, S. Richards, G.M. Weinstock, R.K. Wilson, R.A. Gibbs, W.J. Kent, W. Miller, and D. Haussler. 2005. Evolutionarily conserved elements in vertebrate, in-sect, worm, and yeast genomes. Genome Res 15: 1034-1050.

Silva, J.C. and A.S. Kondrashov. 2002. Patterns in spontaneous mutation revealed by human-baboon sequence comparison. Trends Genet 18: 544-547.

Smith, N.G., M.T. Webster, and H. Ellegren. 2002. Deterministic mutation rate variation in the human genome. Genome Res 12: 1350-1356.

Sonnhammer, E.L. and R. Durbin. 1995. A dot-matrix program with dy-namic threshold control suited for genomic DNA and protein se-quence analysis. Gene 167: GC1-10.

Spencer, C.C., P. Deloukas, S. Hunt, J. Mullikin, S. Myers, B. Silverman, P. Donnelly, D. Bentley, and G. McVean. 2006. The influence of re-combination on human genetic diversity. PLoS Genet 2: e148.

Stajich, J.E., D. Block, K. Boulez, S.E. Brenner, S.A. Chervitz, C. Dag-digian, G. Fuellen, J.G. Gilbert, I. Korf, H. Lapp, H. Lehvaslaiho, C. Matsalla, C.J. Mungall, B.I. Osborne, M.R. Pocock, P. Schattner, M. Senger, L.D. Stein, E. Stupka, M.D. Wilkinson, and E. Birney. 2002. The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12: 1611-1618.

Struhl, K. 1985. Naturally occurring poly(dA-dT) sequences are upstream promoter elements for constitutive transcription in yeast. Proc Natl Acad Sci U S A 82: 8419-8423.

Tamura, K., J. Dudley, M. Nei, and S. Kumar. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) Software Version 4.0. Mol Biol Evol.

Tautz, D. and M. Renz. 1984. Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res 12: 4127-4138.

Tautz, D., M. Trick, and G.A. Dover. 1986. Cryptic simplicity in DNA is a major source of genetic variation. Nature 322: 652-656.

Taylor, M.S., C.P. Ponting, and R.R. Copley. 2004. Occurrence and conse-quences of coding sequence insertions and deletions in Mammalian genomes. Genome Res 14: 555-566.

Templeton, A.R., A.G. Clark, K.M. Weiss, D.A. Nickerson, E. Boerwinkle, and C.F. Sing. 2000. Recombinational and mutational hotspots within the human lipoprotein lipase gene. Am J Hum Genet 66: 69-83.

The ENCODE Project Consortium. 2007. Identification and analysis of func-tional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799-816.

Page 59: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

59

Thomas, J.W., J.W. Touchman, R.W. Blakesley, G.G. Bouffard, S.M. Beck-strom-Sternberg, E.H. Margulies, M. Blanchette, A.C. Siepel, P.J. Thomas, J.C. McDowell, B. Maskeri, N.F. Hansen, M.S. Schwartz, R.J. Weber, W.J. Kent, D. Karolchik, T.C. Bruen, R. Bevan, D.J. Cutler, S. Schwartz, L. Elnitski, J.R. Idol, A.B. Prasad, S.Q. Lee-Lin, V.V. Maduro, T.J. Summers, M.E. Portnoy, N.L. Dietrich, N. Akhter, K. Ayele, B. Benjamin, K. Cariaga, C.P. Brinkley, S.Y. Brooks, S. Granite, X. Guan, J. Gupta, P. Haghighi, S.L. Ho, M.C. Huang, E. Karlins, P.L. Laric, R. Legaspi, M.J. Lim, Q.L. Maduro, C.A. Masiello, S.D. Mastrian, J.C. McCloskey, R. Pearson, S. Stan-tripop, E.E. Tiongson, J.T. Tran, C. Tsurgeon, J.L. Vogt, M.A. Walker, K.D. Wetherby, L.S. Wiggins, A.C. Young, L.H. Zhang, K. Osoegawa, B. Zhu, B. Zhao, C.L. Shu, P.J. De Jong, C.E. Lawrence, A.F. Smit, A. Chakravarti, D. Haussler, P. Green, W. Miller, and E.D. Green. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-793.

Thomas, M., S. Ihle, I. Ravagarimanana, S. Kraechter, T. Wiehe, and D. Tautz. 2005. Microsatellite variability in wild populations of the house mouse is not influenced by differences in chromosomal re-combination rates. Biol J Linn Soc 84: 629-635.

Toth, G., Z. Gaspari, and J. Jurka. 2000. Microsatellites in different eu-karyotic genomes: survey and analysis. Genome Res 10: 967-981.

Uhlemann, A.C., N.A. Szlezak, R. Vonthein, J. Tomiuk, S.A. Emmer, B. Lell, P.G. Kremsner, and J.F. Kun. 2004. DNA phasing by TA dinu-cleotide microsatellite length determines in vitro and in vivo expres-sion of the gp91phox subunit of NADPH oxidase and mediates pro-tection against severe malaria. J Infect Dis 189: 2227-2234.

Weber, J.L. and P.E. May. 1989. Abundant class of human DNA polymor-phisms which can be typed using the polymerase chain reaction. Am J Hum Genet 44: 388-396.

Weber, J.L. and C. Wong. 1993. Mutation of human short tandem repeats. Hum Mol Genet 2: 1123-1128.

Webster, M.T., E. Axelsson, and H. Ellegren. 2006. Strong regional biases in nucleotide substitution in the chicken genome. Mol Biol Evol 23: 1203-1216.

Webster, M.T., N.G. Smith, L. Hultin-Rosenberg, P.F. Arndt, and H. Elle-gren. 2005. Male-driven biased gene conversion governs the evolu-tion of base composition in human alu repeats. Mol Biol Evol 22: 1468-1474.

Wienberg, J. and R. Stanyon. 1995. Chromosome painting in mammals as an approach to comparative genomics. Curr Opin Genet Dev 5: 792-797.

Wierdl, M., M. Dominska, and T.D. Petes. 1997. Microsatellite instability in yeast: dependence on the length of the microsatellite. Genetics 146: 769-779.

Wilson, M.D., C. Riemer, D.W. Martindale, P. Schnupf, A.P. Boright, T.L. Cheung, D.M. Hardy, S. Schwartz, S.W. Scherer, L.C. Tsui, W.

Page 60: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

60

Miller, and B.F. Koop. 2001. Comparative analysis of the gene-dense ACHE/TFR2 region on human chromosome 7q22 with the orthologous region on mouse chromosome 5. Nucleic Acids Res 29: 1352-1365.

Winckler, W., S.R. Myers, D.J. Richter, R.C. Onofrio, G.J. McDonald, R.E. Bontrop, G.A. McVean, S.B. Gabriel, D. Reich, P. Donnelly, and D. Altshuler. 2005. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308: 107-111.

Vinogradov, A.E. 1997. Nucleotypic effect in homeotherms: body-mass independent resting metabolic rate of passerine birds is related to genome size. Evolution 51: 220-225.

Vinogradov, A.E. 2002. Growth and decline of introns. Trends Genet 18: 232-236.

Vinogradov, A.E. and O.V. Anatskaya. 2006. Genome size and metabolic intensity in tetrapods: a tale of two lines. Proc Biol Sci 273: 27-32.

Wolfe, K.H., P.M. Sharp, and W.H. Li. 1989. Mutation rates differ among regions of the mammalian genome. Nature 337: 283-285.

Womack, J.E. and S.R. Kata. 1995. Bovine genome mapping: evolutionary inference and the power of comparative genomics. Curr Opin Genet Dev 5: 725-733.

Woolfe, A., M. Goodson, D.K. Goode, P. Snell, G.K. McEwen, T. Vavouri, S.F. Smith, P. North, H. Callaway, K. Kelly, K. Walter, I. Abnizova, W. Gilks, Y.J. Edwards, J.E. Cooke, and G. Elgar. 2005. Highly conserved non-coding sequences are associated with vertebrate de-velopment. PLoS Biol 3: e7.

Yang, S., A.F. Smit, S. Schwartz, F. Chiaromonte, K.M. Roskin, D. Haus-sler, W. Miller, and R.C. Hardison. 2004. Patterns of insertions and their covariation with substitutions in the rat, mouse, and human ge-nomes. Genome Res 14: 517-527.

Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13: 555-556.

Yu, A., C. Zhao, Y. Fan, W. Jang, A.J. Mungall, P. Deloukas, A. Olsen, N.A. Doggett, N. Ghebranious, K.W. Broman, and J.L. Weber. 2001. Comparison of human genetic and sequence-based physical maps. Nature 409: 951-953.

Zhang, Z. and M. Gerstein. 2003. Patterns of nucleotide substitution, inser-tion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res 31: 5338-5348.

Zhang, Z., S. Schwartz, L. Wagner, and W. Miller. 2000. A greedy algo-rithm for aligning DNA sequences. J Comput Biol 7: 203-214.

Page 61: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological
Page 62: Bioinformatic Analysis of Mutation and Selection in the Vertebrate …170790/... · 2009-02-14 · Bioinformatics Bioinformatics, the collection, management and analysis of biological

Acta Universitatis UpsaliensisDigital Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology 348

Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science andTechnology, Uppsala University, is usually a summary of anumber of papers. A few copies of the complete dissertationare kept at major Swedish research libraries, while thesummary alone is distributed internationally through theseries Digital Comprehensive Summaries of UppsalaDissertations from the Faculty of Science and Technology.(Prior to January, 2005, the series was published under thetitle “Comprehensive Summaries of Uppsala Dissertationsfrom the Faculty of Science and Technology”.)

Distribution: publications.uu.seurn:nbn:se:uu:diva-8240

ACTA

UNIVERSITATIS

UPSALIENSIS

UPPSALA

2007