gene discovery using combined signals from genome sequence and natural selection michael brent...
TRANSCRIPT
Gene discovery using combined Gene discovery using combined signals from genome sequence signals from genome sequence
and natural selectionand natural selection
Michael BrentWashington University
The mouse genome analysis group
GENSIPS 10/7/2002 2
Genes are read out via mRNAGenes are read out via mRNA
& processing
GENSIPS 10/7/2002 3
RNA ProcessingRNA Processing
GENSIPS 10/7/2002 4
A typical human gene structureA typical human gene structure
GENSIPS 10/7/2002 5
In a mammalian genomeIn a mammalian genome
Finding all the genes is hard• Mammalian genomes are large
– 5,051 miles of 10pt type– Raleigh to Tripoli, Libya
• Only about 1.5% protein coding– Raleigh to Winston-Salem
GENSIPS 10/7/2002 6
Genes are fairly unconstrainedGenes are fairly unconstrained
Intron length is highly variable• ~5% are 40-100 nt long• ~3% are longer than 30,000 nt
Distance between genes is highly variable• From 103 to 106 nt or more (probably)
GENSIPS 10/7/2002 7
Exons per gene (RefSeq)Exons per gene (RefSeq)
0%
2%
4%
6%
8%
10%
12%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30+
Number of Exons
Pe
rce
nt
of
Ge
ne
s
Ref Seq
GENSIPS 10/7/2002 8
Background is not randomBackground is not random
Segmental duplications• Entire regions duplicate, then diverge slowly
Processed pseudogenes• Spliced transcripts integrate back into the genome
– Sequence is similar to source genes– Generally not functional
GENSIPS 10/7/2002 9
Gene prediction: two approachesGene prediction: two approaches
1. Transcript-based (E.g., GeneWise)A. Map experimentally determined sequences of
spliced transcripts to their genomic sourceB. Map transcript sequences to genomic regions
that could produce similar transcripts
2. De novo (genome only)• Model DNA patterns characteristic of gene
components– Splice donor and accepter– Protein coding sequence– Translation start and stop
GENSIPS 10/7/2002 10
Advantages and disadvantagesAdvantages and disadvantages
Transcript-based • Advantage: conservative
–Evidence of transcription for every exon• Disadvantage: conservative
–Can’t find “truly novel” genes• Still subject to error
GENSIPS 10/7/2002 11
Advantages and disadvantagesAdvantages and disadvantages
De novo• Advantage 1: Less biased toward
–Known transcripts–Transcripts that can be sequenced easily
• Advantage 2: Genome sequencing is easy• Disadvantages
–No direct evidence of transcription–Presumably, more false positives
GENSIPS 10/7/2002 12
Single-genome Single-genome dede novonovo: : GenscanGenscan
Strengths• For mammalian sequence, one of the best
single-genome, de novo gene predictors• Widely used to great practical advantage• De facto standard for mammalian sequence
Limitations• Predicts >45K genes (best est.: 25-30K)• Predicts >315K exons (best est. 200K-250K)• Gets only 9% of known genes exactly right*
GENSIPS 10/7/2002 13
Dual genome Dual genome de novode novo
We developed algorithms that use two genomes to• Reduce the number of false positives• Refined the details of the structures
GENSIPS 10/7/2002 14
Probability model• Assigns probability to annotated DNA sequences:
5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’
Optimization algorithm• Given a DNA sequence, find the most probable
annotation, according to the model
Exon5’ UTR Intron
Single-genome de novo methodSingle-genome de novo method
GENSIPS 10/7/2002 15
CCATGGCGTCTTCAGGCAGTGACTC
Genscan’s generative modelGenscan’s generative model
IntronExonIntron
GENSIPS 10/7/2002 16
Generalized
HMM
• States correspond to gene features• Model generates DNA sequence
by passing through states• The probability of annotated DNA
sequence is the probability of –generating the DNA sequence –by passing through states corre-
sponding to the annotation.
Genscan’s generative modelGenscan’s generative model
GENSIPS 10/7/2002 17
Dual genome predictionDual genome prediction
Input• Target and informant genomes
Idea• Patterns of evolution since the last common
ancestor may reveal gene structure
GENSIPS 10/7/2002 18
Two conservation signalsTwo conservation signals
1. Local alignment signal• Selective pressures differ by feature• This leaves a characteristic signature
2. Structural signal• Locations of introns tend to be conserved
GENSIPS 10/7/2002 19
Characteristic local alignmentsCharacteristic local alignments
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
|||||||||||||||||||| || ||||| || || |||
TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC
Coding exon
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
|||||| || | ||||||||| || || ||
CTAGAGC----AAGAAGACAGGTACCATAGGGCTCTCCT
Intron (non-coding)
human
human
mouse
mouse
GENSIPS 10/7/2002 20
Conservation of intron locationConservation of intron location
GENSIPS 10/7/2002 21
AlignAlign→→predictpredict→→filterfilter→→testtest
WU-BLAST
Aligned Intron Filter
Validation (RT-PCR)
TTATCCACCAGACCAGATAGATACTTGTCTGCCACCCTC
TTATCCACCAGACCAGATAGGTATTTGTCAGCTACTCTC
TCTGCCACC|| || ||TCAGCTACT
TWINSCAN
GENSIPS 10/7/2002 22
gHMM decodingRepresentation change
TCTGCCACC||:||:||
TCTGCCACC|| || ||TCAGCTACT
Conservation sequenceTWINSCAN
GENSIPS 10/7/2002 23
BLAST AlignmentsBLAST Alignments
TargetInformant
GENSIPS 10/7/2002 24
Projecting BLAST AlignmentsProjecting BLAST Alignments
TargetInformant
GENSIPS 10/7/2002 25
Projecting BLAST AlignmentsProjecting BLAST Alignments
TargetInformant
GENSIPS 10/7/2002 26
Projecting BLAST AlignmentsProjecting BLAST Alignments
TargetInformant
GENSIPS 10/7/2002 27
Projecting BLAST AlignmentsProjecting BLAST Alignments
TargetInformant
GENSIPS 10/7/2002 28
Conservation sequenceConservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
Synthetic (projected) local alignmenthuman
mouse
|||||| | ||||||||| || || ||
CTAGAG AGACAGGTACCATAGGGCTCTCCT
Pair each nucleotide of the target with• “|” if it is aligned and identical
GENSIPS 10/7/2002 29
Conservation sequenceConservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
Synthetic (projected) local alignmenthuman
mouse
|||||| |:|||||||||::||:|| ||:
CTAGAG AGACAGGTACCATAGGGCTCTCCT
Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap
GENSIPS 10/7/2002 30
Conservation sequenceConservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
Synthetic (projected) local alignmenthuman
mouse
||||||. . . . . . . . .|:|||||||||::||:|| ||:
CTAGAG AGACAGGTACCATAGGGCTCTCCT
Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned
GENSIPS 10/7/2002 31
Conservation sequenceConservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGC---CCC
Conservation sequencehuman
||||||. . . . . . . . .|:|||||||||::||:|| ||:
Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned
GENSIPS 10/7/2002 32
Conservation sequenceConservation sequence
CTAGAGATGCAAAAGAAACAGGTACCGCAGTGCCCC
Conservation sequencehuman
||||||. . . . . . . . .|:|||||||||::||:||||:
Pair each nucleotide of the target with• “|” if it is aligned and identical• “:” if it is aligned to mismatch or gap• “.” if it is unaligned
GENSIPS 10/7/2002 33
Probability model• Assigns probability to annotated DNA:
5’TAGCCTACTGAAATGGACCGCTTCAGCGTGGTAT3’ |||........|:||||:|||||||||:||::||
Optimization• Given DNA and conservation sequence, find the most
probable annotation, according to the model
Exon5’ UTR Intron
Twinscan: Extending the modelTwinscan: Extending the model
GENSIPS 10/7/2002 34
• Each state “generates” DNA and conservation sequence independently
• Probability of annotated DNA and conservation sequence is probability of generating the DNA and conservation sequence by passing through corresponding states
TwinscanTwinscan
GENSIPS 10/7/2002 35
Performance EvaluationPerformance Evaluation
RefSeq• A set ~13,000 “Known” mRNAs• Represents ~40-50% of human genes
–Usually, only one of several splices• Mapping to genome is imperfect• Best available gold standard
GENSIPS 10/7/2002 36
GENSIPS 10/7/2002 37
0%
2%
4%
6%
8%
10%
12%
14%
16%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30+
Number of Exons
Pe
rce
nt
of
Ge
ne
s
Ref Seq
Twinscan
GENSIPS 10/7/2002 38
GENSIPS 10/7/2002 39
GENSIPS 10/7/2002 40
Short term goalShort term goal
All multi-exon human genes• Predict accurately
–Integrate information from more genomes• Verify at least one intron experimentally• Follow up with full-length verification
GENSIPS 10/7/2002 41
AcknowledgmentsAcknowledgmentsFunding agencies
• National Institutes of Health (NHGRI)• National Science Foundation (DBI)
Sequencing centers• Sanger, Whitehead, Wash. U.
My group• Ian Korf, Paul Flicek, Evan Keibler, Ping Hu
Collaborators• Roderic Guigo, Josep Abril, Genis Parra
– Pankaj Agarwal• Stylianos Antonarakis, Alexandre Reymond, Manolis
Dermitzakis
GENSIPS 10/7/2002 42
Other cladesOther clades
Plants• Arabidopsis thaliana, cabbage, rice
Nematodes• C. elegans, C. briggsae
Fungi• Cryptococcus neoformans (JEC21, H99)
GENSIPS 10/7/2002 43
Pair HMM algorithms (SLAM,…)Pair HMM algorithms (SLAM,…)
• Input is orthologous sequences.• Aligns and predicts simultaneously, using a
joint probability model• Predicts orthologous genes in 2 sequences• All predicted CDS is aligned• Some aligned regions are not predicted CDS
–Labeled conserved non-coding sequence
GENSIPS 10/7/2002 44
The algorithms (SLAM,…)The algorithms (SLAM,…)
sgp2• Alignment before prediction (tblastx)• Predicts genes in target sequence only• Don’t need orthologous input sequences
–Paralogs & low-coverage shotgun can help• Modifies scores of all potential exons, by
–At each base, add tblastx score of best overlapping local alignment (roughly)
–To gene-id scores of that potential exon
GENSIPS 10/7/2002 45
The algorithmsThe algorithmsTWINSCAN
• Alignment before prediction (blastn)• Predicts in target sequence only• Modifies scores of all potential exons, UTRs,
splice sites, start and stop models, by–At each base, apply a feature-specific
scoring model (estimated for this purpose)–to the best overlapping local alignment,
and adding the result–To Genscan scores for that feature
GENSIPS 10/7/2002 46
% Aligned, CDS vs. other% Aligned, CDS vs. other
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Coding Non-coding
Genscansgp2TwinscanSanger
10/7/2002 47GENSIPS
QuerySequence
tblastxHSPs
geneidExons
HSPsProjectio
ns
SGPExons
Syntenic Gene Prediction (sgp2)Syntenic Gene Prediction (sgp2)
GENSIPS 10/7/2002 48
Why work on gene finding?Why work on gene finding?
Genes are• Components responsible for biological function• Variations cause human disease / susceptibility• Controls for modifying biological function
–Human gene therapy–Agriculture–Nanotechnology, etc.