introduction to genomics using yeasts as model organisms · just working by similarity, no...
TRANSCRIPT
Introduction to Genomics using yeasts
as model organisms
Strasbourg team
Bleykasten Claudine
Despons Laurence
de Montigny Jacky
Friedrich Anne
Ivanov Samy (Sofia)
Jung Paul
Kugler Valérie
Leh Véronique
Potier Serge
Schacherer Joseph
Souciet Jean-Luc
Straub Marie-laure
Spehner Catherine
Uzunov Zlatyo (Sofia)
Génolevures teams
J.-L. Souciet, Université de Strasbourg,CNRS (Génolevures Project Coordinator))
28 rue Goethe, F-67000 Strasbourg, France
B. Dujon, Institut Pasteur, CNRS, Université Pierre et Marie Curie
UFR927, 25 rue du Docteur Roux, F-75724 Paris, France
C. Gaillardin AgroParisTech, INRA, CNRS F-78850 Thiverval-Grignon, France
D.Sherman, LaBRI, CNRS 351 cours de la Libération, 33405 Talence Cedex, France
associated with the sequencing center
J. Weissenbach, Génoscope (CEA) 2 rue Gaston Crémieux, BP 191, F-91057 Evry Cedex, France
http://www.genolevures.org/
Preliminary introduction
Darwin
2009 bicentenary celebration
it is over
2010 is the year of Biodiversity
however we have to remember the
Darwin’s contribution in the scope
of the studies on genome evolution
Darwinian evolution in the
light of genomics
Koonin E.V. Nucleic Acids Research 2009, 37, 1011-1034
In brief from Darwin :
1/ Undirected, random variation is the main driving force for evolution
2/ Evolution proceeds by fixation of the rare beneficial variations and
elimination of deleterous variations = natural selection
3/ Beneficials changes that are fixed by natural selectionn are
infinitesimally small = evolution is accumulation of these tiny modifications
4/ the evolutionary process remains processes rmained the same throught
history life
5/ evolution could be represented by a single tree TOL (Tree Of Life)
(later, later after Darwin = LUCA (Last Universal Common Ancestor))
Conclusion from Genome Analysis (Genomics sensu lato) Studies April 2009.
1/ Infinitesimal changes = point mutations
2/ All kinds of duplications (WGD, segmental, single gene) or deletions
(single gene, chromsomal part, gene erasion or relics) gene lost
3/ Horizontal Gene Transfer (HGT) fot single or blocks
4/ various types of genomes rearrangements
5/ Diverse types of selfish genetic elements
6/ No unidirectionnal fates for genes (ex : gain and lost several times)
7/ The relative contributions of different evolutionarily forces greatly vary
from lineage to lineage
8/ Each genome is a PALIMPSET = a diverse collection of genes with
different evolutionary fates
A palimpsest is a manuscript page from a scroll or book that has been
scraped off and used again. The word "palimpsest" comes through
Latin from Greek παλιν + ψαω = ("again" + "I scrape"), and meant
"scraped (clean and used) again." Romans wrote on wax-coated
tablets that could be smoothed and reused, and a passing use of the
rather bookish term "palimpsest" by Cicero seems to refer to this
practice.
The term has come to be used in similar context in a variety of
disciplines, notably architectural archaeology….
and Genomics!
Definition extracted from Wikipedia
Genomics
Basic questions:
How genes arise ?
How genes disappear ?
How genes arise ?
First, duplication.
How, mechanisms ?
- segmental duplications
- tandem duplications
- duplications by retroposition
- duplication by aneuploidy
- duplication by polyploidy
- etc….
How genes disappear ?
- deletions
- accumulation of numerous point mutations
- translocations
The Génolevures strategy
-phylogeneticaly related species
- species or clades choice (Kurtzman)
- criteria ?
- complete sequencing ? telomere to telomere
(link to new emerging sequencing technologies)
- just working by similarity, no functional analysis
- annotation by experts?
(proteome of high quality and most of the detectable ncARN are identified)
- database devoted to comparative genomic analysis
(http://www.genolevures.org/)
Fungal biodiversity
(Euascomycetes)
(Hemiascomycetes)
(Archiascomycetes)
A least 3 major groups within Saccharomycotina ,
(previously hemiascomycetes) :
- Saccharomyces complex or Saccharomycetaceae
14 clades (Kurtzman 2003), point centromeres
- reassigment of the genetic code : the CTG codons ar translated
in serine rather than leucine ex. Candida albicans or
Debaryomyces hansenii
- GC rich genome ex. Yarrowia lipolytica
WGD
Kurtzman 2003
14 clades for the
Saccharomyces complex or Saccharomycetaceae (compact centromere)
- Wolfe 1997 - Kellis 2004
(P.Philippsen)
P
r
o
t
o
p
l
o
i
d
s
- goals, strains, genome sizes: eukaryotes/prokaryotes
- chromosomal mapping, genomic DNA banks,
and genome coverage, redundancy, overlapping
fragments
- DNA sequencing (and new methods Pyrosequencing
evolving very fast= new approaches)
- chromosomal assembly
- identifications of the genetic elements ( ORFs/CDSs, TR,
tRNA, repetitive sequences,
- definitions: orthologs, paralogs, duplication,
common and specific genes, species specific genes,
synteny
- ncRNA,…
- comparisons using BLAST
- mirror effects
- small ORFs /CDSs
- intron detection (a possible strategy)
- mechanisms of genome evolution
- comparative genomics and phylogenetics
- impact of new sequencing technologies , 454, Solexa
- other developments: transcriptomics
TOL three domains of life (cf LUCA)
Eubacteria
Eukaryotes
Archea Tree of life
Saccharomyces
cerevisiae
Caenorabditis elegans
Drosophila
melanogaster
Mus musculus
Arabidopsis thaliana
Model organisms
Goals of genomics
Systematic sequencing of each chromosome
(eucaryotes from telomere to telomere)
(procaryotes often circular not always (!) and number of
DNA molecule/strain)
Both DNA strands with perfect complementarity
Towards 0% error (difficult)
If the previous steps are finished: it will be
possible (theoretically) to detect all the genetic objects
within the sequenced species
General strategy:
- step 1: from chromosome to the
DNA sequence
- step 2: identification within the
DNA sequence of all the genetic
objects for production of a very
precise chromosomal map
The G+C content:
Highly variable among procaryotes
- 28,6 % Borrelia burgdorferi
- 69,4 % Thermus thermophilus
Less variable among eucaryotes
- 41 % Mus musculus
- 38 % S. cerevisiae
- 49 % Yarrowia lipolytica
- 35 % Arabidopsis thaliana
one example
link = genetic code, codon usage
G+ C value is important , example:
Nature, March 2010, A.J. Chapman et al.
Hydra 29% G+C Hydra magnipapillata
Associated with Hydra is a bacteria
a novel Curvibacter (the reference name will be published later)
with G+C content up to 60%
If contigs (see later) are differents with their respective
G+C content, this is an indication that there are probably
different genomes
Classes of interspersed repeat in the human genome
6-8 kb 850,000 21%
Length Copy Fraction of
number genome
100-300 bp 1,500,000 13%
6-11 kb
1.5-3 kb
450,000 8%
2-3 kb
80-3,000 bp
300,000 3% }
}
AAA
AAA A B
gag pol (env)
(gag)
transposase
LINEs Autonomous
SINEs Non-autonomous
Retrovirus-like
elements Autonomous
Non-autonomous
DNA Autonomous
transposon
fossils Non-autonomous ] [
Retrovirus
LTR LTR gag pol env
7 kb
Ty1/copia retrotransposon
LINE (Long Interspersed Nuclear Elements)
gag ? pol poly(A)
6 kb
SINE (Short Interspersed Nuclear Elements)
0,3 kb
JA Chapman et al. Nature 000, 1-5 (2010) doi:10.1038/nature08830
Dynamics of transposable element expansion in
Hydra reveals several periods of transposon activity.
Sequencing strategy :
(if using only Sanger technology, up to year 2007)
Genomic DNA bank 1: 3 to 5 kb long DNA fragments; high copy number E. coli vector
Genomic DNA bank 2: 5 to 7/10 kb long DNA fragments;low copy number E. coli vector
Genomic DNA bank:
- cosmides (35-40 kb)
- BACs (70-150 kb)
Both ends sequencing: required for contigs assembly
Checking the super contig assembly :
- by fingerprint
- by marqker genes assignation on chromosomes
( hybidization or chromosome painting; depending of the species)
(However depending of the genome size and of the % of repetitive elements)
Genomics using yeasts as model organisms:
- goals, strains,genome size eukaryotes/prokaryotes
- chromosomal mapping, genomic DNA banks,
and genome coverage, redundancy, overlapping
fragments
- DNA sequencing (and new methods Pyrosequencing)
- chromosomal assembly
- identifications of the genetic elements ( ORFs/CDSs, TR,
tRNA, repetitive sequences,
- definitions: orthologs, paralogs, duplication,
common and specific genes, speciation genes?, synteny
- comparison with BLAST
- mirror effects
- small ORFs /CDSs
- intron detection
- mechanisms of genome evolution
- comparative genomics and phylogenetics
- impact of new sequencing technologies , 454
- other developments: transcriptomics
Amino acid numbers to encode different enzymes of the glycolytic pathway in S. cerevisiae Glycolyse GLK1 aldohexose specific glucokinase 500 aa HXK 2 hexokinase II 486 aa HXK 1 hexokinase I 485 aa PGI1 glucose-6-phosphate isomerase 554 aa FBA1 fructose-bisphosphate aldolase 359 aa GPD 1 glyceraldehyde-3-phosphate dehydrogenase 1 332 aa GPD 2 glyceraldehyde-3-phosphate dehydrogenase 2 332 aa GPD 3 glyceraldehyde-3-phosphate dehydrogenase 3 332 aa PGK1 phosphoglycerate kinase 416 aa GPM 1 phosphoglycerate mutase 247 aa GPM 3 phosphoglycerate mutase 303 aa GPM 2 phosphoglycerate mutase 311 aa ENO 2 enolase II (2-phosphoglycerate dehydratase) 437 aa ENO 1 enolase I (2-phosphoglycerate dehydratase) 437 aa PYK 2 pyruvate kinase 506 aa PYK 1 pyruvate kinase 500 aa
Alcoolic fermentation PDC 1 pyruvate decarboxylase, isozyme 1 563 aa PDC 5 pyruvate decarboxylase, isozyme 2 563 aa PDC 6 pyruvate decarboxylase, isozyme 3 563 aa
CDS or gene 1 CDS or gene 2 CDS or gene 3
Intergenic area Intergenic area
Part of a chromosome area
The cover shows the devastating results of potato blight
(Phytophthora infestans) infestation - the pathogen that triggered
the Irish potato famine in the nineteenth century. The genome
of this still dangerous pathogen has now been sequenced,
revealing fast evolving effector genes that may contribute to
the rapid adaptability to host plants that has made potato blight
so difficult to control. Nature 461, pp315-438, September 2008
75 % repeated
Sequences,
Mainly RT elements
- goals, strains,genome
sizes eukaryotes/prokaryotes
Standardisation at the hapoid level
C value represents the haploid genome
size of one organism
(C for « constant » or « characteristic »)
(1000 bp= 1kb, 1000 kb= 1Mb, 1000 Mb= 1Gb)
Génomes complètement séquencés de quelques Archées
Génomes complètement séquencés de quelques Eubactéries
Génomes complètement séquencés de quelques Eucaryotes
Conclusion:
In procaryotes genome size is relatively
small and in correlation with the number
of genes encoding proteins
In eucaryotes no correlation between
the genome size and the number of
genes encoding proteins
- species
- strains and polymorphism (HETEROZYGOTE!)
- genetic analysis = deletions,
duplications, translocations,… in addition to basic
point mutations
- consequence = the gene number is different
from strain to strain (very often gene number is
referring to the genes encoding proteins, but this
is wrong, tDNA, rDNA genes, sn RNA….)
- the gene number of Saccharomyces
cerevisiae is
5813 is a wrong expression
- the gene number of Saccharomyces
cerevisiae
strain S288C is 5813: is a good expression
+ the date, because annotation is a never
ending process……….
The same for Homo sapiens sapiens and all
other living organisms
Genomic DNA banks (10 X coverage)
- redundancy (including boundaries overlaps)
- representativity
- rich (only clones with expected DNA inserts (99 %))
The BAC cloning scheme
Average Fragment Sizes of Mammalian DNA Produced by
Cleavage with Rare Cutting Restriction Enzymes
Library size
(genome coverage) P (%)
0.5 39.3
1 63.2
2 86.4
3 95.0
4 98.2
5 99.3
6 99.75
7 99.91
8 99.97
9 99.99
10 99.995
Cx
P (x) = ------- e-c
x!
cov erage f rom 0.5 to 10.
Probability of Having One or More Clones/
Locus within a Library as a Function of Library Size
The probability of f inding x clones f rom a library of c cov erage is calculated as
The probabilities shown are those of f inding one or more clones (1- P (0)) f or libraries with genome
Haploid Average
genome size insert size
of organism of clones no. of no. of 384-well
(Mb) (kb) clones plates
20 50 3000 8
1000 100 7500 20
3000 50 450 000 1172
3000 150 150 000 391
Number of Clones Required for 7.5x Genomic Coverage
Library size
The importance of the physical mapping of large DNA
fragments:
duplicated DNA sequences and possible assembly
mistakes
- contigs
- ordering large DNA fragments or
chromosome walking
- correspondance contigs map and physical map
X
A
B
chromosome V
YEP06
FCY21 FCY22
Ordering clones
by chromosome walking
Contigs map check by fingerprint:
an example with Not 1 digestion
also Hind III often used ( target size and resolution)
Note: with large DNA fragments ( more than 10 kb)
the size determination is achieve by PFGE
Digestion of BACs with NotI
HindIII fingerprints of human BACs and PACs
5’ATGTCTGCTGCTGCTGATAGATAACTTAACTTCCGGCCAC 5’AACTTAACTTCCGGCCACTTGAATGCTGGT
5’CTGCTGATAGATAACTTAACTTCCGGCCACTTGAAT
5’CCGGCCACTTGAATGCTGGTAGA
5’CTGATAGATAACTTAACTTCCGGCCACTTGAATGCTGG 5’AACTTAACTTCCGGCCACTT
5’CTGATAGATAACTTAACTTCCGGCCACTT
5’GATAACTTAACTTCCGGCCACTTGAATGCT
5’ATGTCTGCTGCTGCTGATAGATAACTT
5’GCCACTTGAATGCTGGTAGA
5’ACTTAACTTCCGGCCACTTGAATGCTGG
5’TGCTGCTGATAGATAACTTAACTTCCGG
Raw sequences :
After assembly…
5’ATGTCTGCTGCTGCTGATAGATAACTTAACTTCCGGCCAC
5’AACTTAACTTCCGGCCACTTGAATGCTGGT
5’CTGCTGATAGATAACTTAACTTCCGGCCACTTGAAT
5’CCGGCCACTTGAATGCTGGTAGA
5’CTGATAGATAACTTAACTTCCGGCCACTTGAATGCTGG
5’AACTTAACTTCCGGCCACTT
consensus sequence
5’ATGTCTGCTGCTGCTGATAGATAACTTAACTTCCGGCCACTTGAATGCTGGTAGA
5’CTGATAGATAACTTAACTTCCGGCCACTT
5’GATAACTTAACTTCCGGCCACTTGAATGCT
ATGTCTGCTGCTGCTGATAGATAACTT
GCCACTTGAATGCTGGTAGA
ACTTAACTTCCGGCCACTTGAATGCTGG
TGCTGCTGATAGATAACTTAACTTCCGG
C-J Rubin et al. Nature 000, 1-7 (2010) doi:10.1038/nature08832
Chicken lines resequenced.
Resequencing