introduction to comparative genomics

45
C e n t e r f o r B i o l o g i c a l S e q u e n c e A n a l y s i s T h e T e c h n i c a l U n i v e r s i t y o f D e n m a r k D T U Introduction to Comparative Genomics Dave Ussery Biological Sequence Analysis DTU course # 27803 15 April, 2005

Upload: fuller-stanton

Post on 30-Dec-2015

84 views

Category:

Documents


8 download

DESCRIPTION

Introduction to Comparative Genomics. Dave Ussery Biological Sequence Analysis DTU course # 27803 15 April, 2005. Outline. What is a genome?. Φ X174. gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa aaattatctt gataaagcag gaattactac tgcttgttta - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Introduction to Comparative Genomics

Dave UsseryBiological Sequence AnalysisDTU course # 2780315 April, 2005

Page 2: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 3: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Outline

• Introduction • Comparison of Genome Sizes• The Human Genome Project• DNA repeats• A few words about speed of sequencing

Page 4: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

What is a genome?

Page 5: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Organism # bp #genes

Φ-X174 5386 9

Escherichia coli 4,600,000 4288

Saccharomyces cerevisiae

13,000,000 5885

Caenorhabditis elegans

~100,000,000~14,00

0

Arabidopsis thaliana ~120,000,000~10,00

0

Drosophila melanogastor

~180,000,000~12,00

0

Homo sapiens~3,400,000,0

00~25,00

0

Page 6: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

ΦX174 gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt gcgacctttc gccatcaact

aacgattctg tcaaaaactg acgcgttgga tgaggagaag tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca

Page 7: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 8: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

The Human Genome Project

Started ~1985

The U.S. government agreed to invest $200,000,000 U.S. per year for 20 years.

One base per second = 216 years!

~3,400,000,000 bp per haploid genome ~6,800,000,000 bp per haploid genome

Page 9: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

year # humangenes mapped

#years to sequence human genome

1970 none not possible

1980 3 ~4,000,000 years

1990 12 ~1000 years

2000 ~25,000 draft

2005 ~30,000new draft + chimp, chicken,

dog, mouse, pig, rat...

Page 10: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 11: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Why is the human genome so large?

One base per second = 216 years!

~3,400,000 bp per haploid genome ~6,800,000 bp per haploid genome

Discussion questions:

~2% or less codes for proteins.

Why is the human genome not finished yet?

Page 12: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

insects

fish

reptiles

higher

plantslower

plants

minerals

birdsm

amm

als

jellyfish

Aristotle’sladder ofcomplexity HUMANS

Page 13: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 14: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

The “C-value paradox”

The genome size of an organism is defined as the amount of haploid DNA in a genomic set (e.g., an egg or sperm nucleus). This is also referred to as the "C-value"; the "C" means "constant" or "characteristic", since the size of a genome is usually constant for a given species.

The large difference in genome sizes without any seeming relation to an organism’s complexity, is called

the C-value paradox.

Page 15: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

What does all this DNA do?

90%

50%

2%

<1%

Page 16: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

The approximate size and characteristics of genomes was characterised in the 1960s, in a classic study of the kinetics of DNA reassociation by Britten and Kohne (1968). 

1. foldback DNA2. highly repetitive DNA

3. middle-repetitive DNA4. single-copy DNA

The repetitive DNA can either be localised to discrete regions, or dispersed.

They found that the DNA could be divided into four fractions:

Britten,R.J., Kohne,D.E., "Repeated sequences in DNA", Science, 161:529-540, (1968). 

DNA repeats

Page 17: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Highly repetitive DNADispersed - e.g., Alu family

• about 300 bp long • 500,000 copies in humans • (about 5% of the human genome) • dispersed throughout the chromosomes

Localised highly repetitive sequences • about 2-10 bp long • present in millions of copies, often in large blocks • (about 6% of the human genome) • associated with heterochromatin • usually very high A+T content

Page 18: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Often, satellite DNA consists of long tandem arrays of repeated sequences, all localised to one or a few discrete regions in the chromosomes.  For example, in the kangaroo rat (Dipodomys ordii), more than 50% of the genome consists of three families of repeated sequences:

(AAG)n, where n = ~2.24 x 10

9

(TTAGGG)n, where n = ~2.2 x 10

9

(ACACAGCGGG)n, where n = ~1.2 x 10

9

Localised repetitive DNA

Page 19: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Middle repetitive DNA

• makes up more than 40% of the human genome

• position varies due to transposable elements

• Includes the following types of sequences:

- Dinucleotide repeats - microsatellite DNA - TRInucleotide repeats

- associated with many diseases - (e.g., Fragile X, muscular distrophy)

Page 20: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

What about bacteria?

Page 21: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

rRNA tree

HUMANS

maize (corn)

Page 22: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Algae

Earth formed (~4.6 Bya)

4.5 Bya 4.0 Bya 3.5 Bya 3.0 Bya 2.5 Bya 2.0 Bya 1.5 Bya 1.0 Bya 0.5 Bya today

Oldest rocks

Time (Billions of years ago, Bya)

Oxygen levels rise

Methanogenic Archaebacteria

Sulfate reducing bacteria

Single-celled eukarotes Edicaran

fauna

Cambrian

Dinosaurs

Oldest bacteriafossils

Multi-cellular eukaryotes

Bacterialecosystems

Page 23: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Earth formed (~4.6 Bya)

4.5 Bya 4.0 Bya 3.5 Bya 3.0 Bya 2.5 Bya 2.0 Bya 1.5 Bya 1.0 Bya 0.5 Bya today

Oldest rocks

Time (Billions of years ago, Bya)

Oxygen levels rise

Bacteria

Archaea4.0 Bya

Eukaryotic microbes

~2.0 Bya

maize (corn)

HUMANS

Page 24: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

rRNA tree

HUMANS

maize (corn)

Page 25: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 26: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 27: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Sequenced Prokaryote Genomes Published

0

50

100

150

200

1995 1996 1997 1998 1999 2000 2001 2002 2003

year

# genomes published

Bact Archaea total

2004

Page 28: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Microbiology, 151:632-636 (March, 2005)

Page 29: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Bacillus cereus, strain ZK CP000001 GenBank, unpublished

Bacillus licheniformis, strain DSM13 AE017333 J Mol Microbiol Biotechnol, 7:204–211, 2004

Bacteroides fragilis YCH46 AP006841 PNAS, 101:14919-14924, (2004). AP006842

Borrelia burgdorferi, strains JD1 and N40 AY696304 PNAS, 101:15151-14155, (2004). (including plasmids) –AY696571

Burkholderia mallei, strain ATCC23344 CP000010 PNAS, 101:14246-14251, (2004).CP000011

Burkholderia pseudomallei BX571965 PNAS, 101:14240-14245, (2004).BX571966

Legionella pneumophila, Philadelphia1 AE017354 Science, 24 September, 2004, 305:1966-1968Legionella pneumophila Lens CR628337 Nature Genetics (2004), In the press.

Mannheimia succiniciproducens AE016827 Nat. Biotechnol. (2004) In the press.

Methanococcus maripaludis S2 BX950229 J. Bact. 186:6956-6969, 2004.

Methylococcus capsulatus (Bath) AE017282 PLoS Biology, 2:1616-1628, 2004.

Nocardia farcinica IFM10152 AP006618 PNAS, 101:14925-14930, (2004).–AP006620

Thalassiosira pseudonana AAFD01000000 Science, 306:79-86, (2004).

Hjælp!

Page 30: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Kronborg Castle

Page 31: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 32: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 33: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 34: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

What is Bioinformatics?

Bioinformatics is the application of machine learning processes to

biological information.

Page 35: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

1978 P. HOGEWEG in Simulation 31 90/1 Since 1970 she has been a staff member at the Subfaculty ofBiology of the University of Utrecht, with her main field of research in bioinformatics. 1985 Jrnl. Theoret.Biol. 113 719 (heading) Tumor escape from immune elimination... R. J. De Beer, Bioinformatics Group,University of Utrecht. 1986 Philos. Trans. Royal Soc. A. 317 324 The area of modelling mutants from aknown structure has been revolutionized by the latest tools of molecular graphics... This is a key elementin the whole technology and has attracted much interest (for example, the recent E.E.C. ‘Bioinformatics’programme). 1987 Science 4 Sept. 1108/3 One of the latest developments [at the European MolecularBiology Laboratory] has been the creation of a new research program in bioinformatics. This is intendedto bring together research in computing science, structural biology, and molecular genetics. 1996 FastCompany Aug.-Sept. 32/3 A lot of breakthroughs in medicine will come out of the efforts of bio-informatics. 2001 N.Y. Times 4 Jan. B6/2 The hope..is to make New York a leader in cutting-edge fieldslike bioinformatics, in which computers are used to decipher genes and proteins.

The science of information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.

bioinformatics, n.

Page 36: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 37: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

What is Biological Information?

Genome -> Transcriptome ->Proteome

Page 38: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 39: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 40: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

The DNA sequence contains information.

But what kind of information?

Page 41: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

ΦX174 gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat

cgaagtggac tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca

Page 42: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 43: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 44: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Page 45: Introduction to Comparative Genomics

Cen

ter fo

r Bio

log

ical S

eq

uen

ce A

naly

sis Th

e T

ech

nica

l Un

iversity

of D

en

mark D

TU

Summary

1. A “genome” is sum of all the DNA sequences (chromosomes) in an organism.

2. The sizes of genomes range about 100,000,000 fold - from small viruses to very large amoebas.

3. Many eukaryotic genomes contain large fractions of repeated DNA sequences, and there is no apparent correlation between the size of an organism and its biological complexity (”C-value Paradox”)

4. Sequence databases are growing faster than computational power!