sequencing - kth · genome assembly assembly quality genes gene finding estscan modern assembly...

6
Genome assembly Assembly quality Genes Gene finding ESTScan Today Genome assembly Ab initio gene finding Genome assembly Assembly quality Genes Gene finding ESTScan Genome sequencing Full overview Localizing genes: order, proximity,. . . Reveal gene regulation Understand pseudogenes Foundation for further studies Genome assembly Assembly quality Genes Gene finding ESTScan Technologies for whole genome sequencing Whole Genome Shotgun: Dominates today. Read random parts of the genome, then assemble. Advantage: Easy to automize, only one type of data Disadvantage: Hard to assemble the pieces! Genome assembly Assembly quality Genes Gene finding ESTScan Technologies for whole genome sequencing Compartmental shotgun: Break genome in chunks, put each chunk in ”BAC”, Bacterial Artifical Chromosome, then WGS on BAC. Advantage: Easier to assemble. Manages duplications better. Disadvantage: More steps, more data: need a physical map. Genome assembly Assembly quality Genes Gene finding ESTScan Mapping a genome Physical map: a localization and ordering of markers on a genome What markers are on your BACs? In what order are they? Genome assembly Assembly quality Genes Gene finding ESTScan Genome assembly Create overlap graph for all reads from genome/chromosome/BAC. Find path through the graph, as for EST data. HUGE computational problem (Dog: 31.5 × 10 6 reads) Genome assembly Assembly quality Genes Gene finding ESTScan Assembly problems Compute the overlaps (solved) What overlap is important? How long? How many changes allowed? Repeats are hard There are pieces missing... There are differences between individuals! Genome assembly Assembly quality Genes Gene finding ESTScan Assembly problem: duplications Compressions such as this can easily total 1% or more of the genome, and the ”orphan” regions can be quite long, 5 000–10 000 bp or more. (Salzberg and Yorke, 2005) Find duplicated regions by looking at the number of reads covering them!

Upload: duongdan

Post on 11-Aug-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Genome assembly Assembly quality Genes Gene finding ESTScan

Today

• Genome assembly• Ab initio gene finding

Genome assembly Assembly quality Genes Gene finding ESTScan

Genome sequencing

• Full overview• Localizing genes: order,

proximity,. . .• Reveal gene regulation• Understand pseudogenes• Foundation for further studies

Genome assembly Assembly quality Genes Gene finding ESTScan

Technologies for whole genomesequencing

• Whole Genome Shotgun: Dominatestoday. Read random parts of the genome,then assemble.Advantage: Easy to automize, only one

type of dataDisadvantage: Hard to assemble the

pieces!

Genome assembly Assembly quality Genes Gene finding ESTScan

Technologies for whole genomesequencing

• Compartmental shotgun: Break genomein chunks, put each chunk in ”BAC”,Bacterial Artifical Chromosome, then WGSon BAC.Advantage: Easier to assemble. Manages

duplications better.Disadvantage: More steps, more data:

need a physical map.

Genome assembly Assembly quality Genes Gene finding ESTScan

Mapping a genome

• Physical map: a localization and ordering ofmarkers on a genome

• What markers are on your BACs?• In what order are they?

Genome assembly Assembly quality Genes Gene finding ESTScan

Genome assembly

• Create overlap graph for all reads fromgenome/chromosome/BAC.

• Find path through the graph, as for ESTdata.

• HUGE computational problem

(Dog: 31.5× 106 reads)

Genome assembly Assembly quality Genes Gene finding ESTScan

Assembly problems

• Compute the overlaps (solved)• What overlap is important?

• How long?• How many changes allowed?

• Repeats are hard• There are pieces missing...• There are differences between individuals!

Genome assembly Assembly quality Genes Gene finding ESTScan

Assembly problem: duplications

Compressions such as this can easily total 1%or more of the genome, and the ”orphan”regions can be quite long, 5 000–10 000 bp ormore. (Salzberg and Yorke, 2005)

Find duplicated regions by looking at thenumber of reads covering them!

Genome assembly Assembly quality Genes Gene finding ESTScan

Modern assembly

• Need constraints• Physical map gives constraints• Paired-ends reads, or mate pairs give

constraints for ”real” WGS• Create clones of a given length — 2 kb,

10 kb, and 50 kb — standard reads in bothends.

• Paired-ends gives scaffolds,”super-contigs”.

Genome assembly Assembly quality Genes Gene finding ESTScan

Scaffolds from paired-endsRegular reads:

Contig 2

Reads

Contig 1

With paired-endsReads

Contig 1 Contig 2

A scaffoldGenome assembly Assembly quality Genes Gene finding ESTScan

CCCAGTCCATTCTCCCACTGATGTCTGTAACTATATTCATCAATCTCTTTGATTTCCAAAGCGCATACCATGGCCTCTGAATGTACTTTGCAGCTTGCCCTTCACACACCCTATACCAATAGTTTTCTGGCTCCTGACCATCAAACTGCCTCCATATGACTGTGCTCTTGTCTTTCCTTAGTTGCATGGGTGTCATCTTATGGGTCACGACCTTCTTAACTGGAACTTTCTTCTTATGGGAGCGATCCCATTTCTTCCAACCTCTCAAAAATTCACCCTCTTCAACAATTGACGCCTCCTCCTTAAGATGCTCAAAATCAAAATAAAACCTAAAATCCTTCCCCTCCTGATCCTTCCCATCCGGATTATAATTACCTGCCAAGCATATACTCAAGTCCATGACAAATGTCCTGTTCAAATACATAGCCTCCCCCAACCCACACAAGAAACTCCACATGTAATGATTCATTCCCTTGCAATAATCCCCTCCTCTTGAATAATACAAGTACTTCCCTTTTCTAAAGTTCGTTTCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCTCTCAAACTATCGTGAGTACGGGATTATATTTCTAAAGTAAGTAAATTATCCAAGATGGGCCGATCCATTACGAAAAGGAAAGATAATCCACTATTATTTCTTAATATAAGGGCAAATTTAAATATATAAAGATGTAATTGTTGTTGGCGAGTGCCTCTCTTGGTTGTTGAGAGTGAAATTGACAGCAAAGTTGTAGATTGTGACAGCCAATGTAACTTATTCACAAATTGGCCTGCCAATGGTACATCATGAATCGCTATGCCACATACTTGATTATACCTTCTAAAGTACCTGTGAATTTTATTTATTTTTTCCATTTTAAAAGTATGTTTTATTTGGAAAAAATATCAAATTATTATTTTACTATTATATTTTAATATATTCTTAAATAAAAAACACTATTAAAAAATATTATGCACCGCAATAATAAACACATATTAAACAAACAGATAATTTTTATATGGCATTTCACTATTGTTGTGGAATAATATCTTT

Genome assembly Assembly quality Genes Gene finding ESTScan

Assembly quality: C

• Popular measure of quality: coverage• C = average number of reads of each

nucleotide in genome• ”Coverage is 10x”: every position in ≈ 10

PCR reads.Genome CD. melanogaster ∼ 14xHuman, HGP 7xHuman, Celera ∼ 5.5xHedgehog, cat, and more 1.8x – 2xDog 8xHorse 7.8x

Genome assembly Assembly quality Genes Gene finding ESTScan

Assembly quality: N50• ”The latest”, N50:

1. Sort all contigs by size2. Count from largest contig and downward, stop

when half the genome is covered.3. Length of last contig is N50.

• Given a 30 Mbp genome with contigs:

4.54.543.5

5

2.52.6

2.5

0.80.91.7

18

N50 is 4 Mbp, because 18 > 30/2

Genome assembly Assembly quality Genes Gene finding ESTScan

N50 characteristics• High N50 ⇒ Good contigs⇒ Good assembly

• Low N50⇒ Many small contigs to cover genome⇒ Bad assembly

• Note: a completely wrong assembly couldhave high N50!”The standard of judging assembly quality by size of contigs isquestionable. Large contigs may simply reflect overlyaggressive joining of contigs, thereby creating larger contigswith mis-assemblies. As a consequence, genome scientistswho are not experts at assembly can be completely misled bystatistics about contig sizes, and as a result might prefer the’larger’ but incorrect assembly when given a choice.”

Salzberg and Yorke, 2005

Genome assembly Assembly quality Genes Gene finding ESTScan

Lander-Waterman model• How many times will a position be

sequenced?• Key assumption: Reads are uniformly

distributed• Coverage is C• ”Let Xi be the number of times position i is

sequenced; Xi is Poisson(C) distributed.”

Pr(Xi = k) = Cke−C/k !

• Special case: Fraction of genome notsequenced:

Pr(Xi = 0) = e−C

Genome assembly Assembly quality Genes Gene finding ESTScan

Example on Lander-Waterman

Coverage Sequenced fraction2 0.8653 0.9504 0.9825 0.9936 0.9987 0.999

10 0.99995

Genome assembly Assembly quality Genes Gene finding ESTScan

More Lander-Waterman results

Require 0 < θ < 1 overlap to join reads into acontig.• Expected number of contigs if N reads:

Ne−C(1−θ)

Dog: 8x, require e.g. 10% overlap, 32× 106 reads:24 000 contigs

• Expected contig size: LeC(1−θ)−1C + θ.

Dog, assume L = 500: contigs are ∼ 83 700 bp

Genome assembly Assembly quality Genes Gene finding ESTScan

Lander-Waterman and reality

• ”For both a simulated unassisted 2x mousegenome assembly (Margulies et al. 2005)and the assisted 1.9x cat genome assemblyof Pontius et al. (2007) euchromaticgenome coverage by assembled contigswas only 65%, significantly less than thetheoretical Poisson expectation (Landerand Waterman 1988) of 85%.”

Green, 2007• Why the discrepancy?

Genome assembly Assembly quality Genes Gene finding ESTScan

Gene finding/prediction

• Goal: Find all genes• Which organism?• What is a gene?

• With or without introns?• With or without UTR?• Count RNA genes?• Only the protein coding part?• Regulatory elements (TFBS)?

• Here: Find the coding parts, the CDS

Genome assembly Assembly quality Genes Gene finding ESTScan

History of gene definitions

• 1866: Mendel and heredity• 1909: The word ”gene”• 1910: Morgan, beads-on-a-string model• 1941: Gene a blueprint for a protein• 1950s: A gene is a molecule• 1960s: A gene is transcribed• 1970s: A gene is an ORF• 1990s: One or several regions on the

genome

Genome assembly Assembly quality Genes Gene finding ESTScan

A modern gene definition

”The gene is a union of genomicsequences encoding a coherent set ofpotentially overlapping functionalproducts.”

Genome assembly Assembly quality Genes Gene finding ESTScan

Prokaryot gene structure

• First 35 bp promoter• Then one or more genes, an operon• No introns• Densely packed, 80

”Almost everything is a gene”• Genes may overlap

Genome assembly Assembly quality Genes Gene finding ESTScan

Find prokaryot genes

Method 1: Look for ORFs ≥ 300 bp (or so).• ORF = Open Reading Frame, i.e. asequence with no stop codon (somedemand starting with ATG).

Method 2: Compare with a gene model.• Larsen & Krogh: ”It is a commonmisconception that identification ofgenes in prokaryotes is almost trivial.”• The ORF method has many falsepredictions

Genome assembly Assembly quality Genes Gene finding ESTScan

Larsen & Krogh: EasyGene

• HMM for prokaryot gene finding• Every box is several model states• Coding part has several possible paths to

handle length variation

Genome assembly Assembly quality Genes Gene finding ESTScan

Codon model for EasyGene

One of threesubmodels for coding

region

• Finds basically everything. ≈ 5 % falsepredictions

• Extra: States are of ”order 4”, i.e. the consider fourprevious emission for the current emission. Astandard HMM model is of order 0.

Genome assembly Assembly quality Genes Gene finding ESTScan

Eukaryot gene structure

From Anders Krogh

Genome assembly Assembly quality Genes Gene finding ESTScan

Eukaryot gene structure: Signals

• Kozak sequence (for vertebrates?) prior toATG

• Acceptor-donor sites for exon/intronboundaries.

• Poly-A signal, consensus: ”AAATAAAA”• Length distributions• Codon frequencies

Genome assembly Assembly quality Genes Gene finding ESTScan

Signal: Kozak sequence

• Kozak-logo for cow

Genome assembly Assembly quality Genes Gene finding ESTScan

Signal: donor site

Munch & Krogh: Agene

Genome assembly Assembly quality Genes Gene finding ESTScan

Signal: acceptor site

Munch & Krogh: Agene

Genome assembly Assembly quality Genes Gene finding ESTScan

Signal: exon lengths

Munch & Krogh: Agene • Coding-length distribution in fly

Genome assembly Assembly quality Genes Gene finding ESTScan

Signal: intron lengths

Munch & Krogh: Agene

• Quite strong limitations on intron length in fly

Genome assembly Assembly quality Genes Gene finding ESTScan

Signal: Genetic code

• Codon frequencies correspond to tRNAaccess

• Deviate strongly from random nucleotidetriples

• What determines access to tRNA?

Genome assembly Assembly quality Genes Gene finding ESTScan

Robustness with genetic code

Genome assembly Assembly quality Genes Gene finding ESTScan

A gene model: Agene

Munch & Krogh: Agene

Genome assembly Assembly quality Genes Gene finding ESTScan

Quality of gene predictions

• Genscan (test fav) suggests 68101 genesin human!

• Error freq varies by species• Exons easily missed• Pseudogenes often misclassified. We have

about 5000 pseudogenes.• Expect to miss almost 50 % of true genes,

and find almost 50 % junk.At best: 90 % sensitivity, 90 % specificity

Genome assembly Assembly quality Genes Gene finding ESTScan

Species specific models

• Must find training data for new species.Hard!

• Want to use info from close species. Maybe surprisingly bad.

• Most interesting development: Gene findingin two species simultaneously, inorthologous regions.

Genome assembly Assembly quality Genes Gene finding ESTScan

Alternative transcripts• How are genes

spliced?• Example: DSCAM

in fly, 38 016possibletranscripts

(Florea, 2006)

Genome assembly Assembly quality Genes Gene finding ESTScan

Alternative transcripts

• Hard to do ab initio? Augustus is oneinitiative

• Use cDNA, EST, other experiments.• Example: Sequence/hybridize mRNA from

possible exon/exon boundaries.

Genome assembly Assembly quality Genes Gene finding ESTScan

Initiatives in gene finding

ENCODE: Find what parts of human genome istranscribed

HAVANA: ”Human And Vertebrate Analysis aNdAnnotation Background”

Both projects are delivering data.

Genome assembly Assembly quality Genes Gene finding ESTScan

ESTScan

• Gene finding in EST data• Simplified variant of GenScan: no introns

etc• Tries finding coding region in EST/contig.

With error correction, dangerous

Genome assembly Assembly quality Genes Gene finding ESTScan

My experience of ESTScan

• Testing on “junk sequences”: 20 % claimedas coding.

• How get good model for chicken?• No good training data• Try prepared models for other speices• Human bad, Arabidopsis better

• Hints off the web: Choose species withsimilar hexamer distribution

• Combine with ORF length analysis