sequence analysis methods for comparative genomics
DESCRIPTION
Sequence Analysis Methods for comparative Genomics. Philipp Bucher. Comparative Genomics SIB PhD program Lausanne Feb 12-16 2007. Topics covered. Alignment methods for genomic DNA sequences Comparative gene prediction Comparative motif finding. Genome sequence comparison - Challenges. - PowerPoint PPT PresentationTRANSCRIPT
Sequence Analysis Methods for comparative Genomics
Comparative Genomics
SIB PhD program
Lausanne Feb 12-16 2007
Philipp Bucher
Topics covered
Alignment methods for genomic DNA sequences
Comparative gene prediction
Comparative motif finding
Genome sequence comparison - Challenges
Very large sequences – up to several 100 million bp per chromosome
Only a small percentage of the genomes may be alignable.
Chromosomal rearrangements:– Insertions
– Translocations
– Duplications
Repetitive elements:– simple repeats
– lineage specific transposons
– ancient transposons
– Processed pseudogenes
Nothing in biology makes sense except in the light of evolutionTh. Dobzhansky
In fact, many contemporary biologists don’t care about evolution but are more interested in molecular function or drug discovery.
To those, comparative genomics is an extremely powerful genome annotation technique.
Genome sequence comparison may helpful in:
the assembly of a new genomes
predicting gene structures
the identification of non-coding regulatory elements
In the recognition of orthologs
to discarding sequence regions not of interest in a particular context
for structural and functional inferences from substitution patterns
Sequence comparison tools overview
Algorithm, program Type Speed Sensitivity
dotter Dot matrix − ++
Lbdot Dot matrix + +
Smith-Waterman Alignment − − +++
blastp Alignment ++ +++
Blastn Alignment ++ −
Megablast, SASHA alignment +++ − −
Avid, Mummer, Lagan alignment + +
Blastz alignment ++ +
BLAST heuristics for protein sequence similarity searches
1. Query compilation: For each position of the query sequence, compile a list of 3-letter words which match the query with score threshold T (score depends on substitution matrix).
2. Word search: Search for two non-overlapping word matches on the same diagonal of the path matrix within a distance A.
3. Un-gapped match extension: Extend the un-gapped alignment along the diagonal in both directions unless the score drops more than X below the maximal score yet attained.
4. Gapped alignment extension: For un-gapped alignments (segment pairs) exceeding a threshold score Sg , trigger a regional gapped alignment until the score drops more than Xg below the maximal score yet attained.
5. Statistical evaluation of gapped alignments: Blast uses hard-wired extreme value distribution parameters λ and K for specific scoring systems (substitution matrix + gap penalties)
Examples of word match scores with Blosum62 matrix
WWW -> 33WWW
YYY -> 21 YYY
GGG -> 18GGG
QQQ -> 15QQQ
SSS -> 13SSS
HSW -> 20HAW
LFY -> 12 IYY
EWD -> 15DWE
KFI -> 8RYV
EWC -> -11FDE
# Matrix made by matblas from blosum62.iij# * column uses minimum score# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units# Blocks Database = /data/blocks_5.0/blocks.dat# Cluster Percentage: >= 62# Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Blast: Principle of two-hit heuristic
Blast: Gap extension.
a: Area considered by gap extension algorithm. Computation starts at the center of the highest scoring segment pair obtained after word-hit extension. b: Path of optimal local alignment found. c: Optimal local alingment.
Why is blastn less sensitive than blastp ?
blastp blastn
Word hits length 3, inexact, based on substitution matrix, two hits
length 11 (default), exact, one hit
Scoring system: Blosum or Pam substitution matrix, affine gap penalties
Identity-difference (default 1, −3), affine gap penalties
Statistics: Karlin-Altschul for gapped alignments
Karlin-Altschul for ungapped alignments (→ non-conservative estimates for gapped alignments )
Blastz overview:Searches for multiple gapped local alignments
– ordering constraints may apply Same 3-step procedure as blast:
– word hit, ungapped extension, gapped extension Word hits:
– exact (default 8) or patterned, e.g. 1110100110010101111 (1 indicates relevant positions).
– optionally transition-tolerant (A:G, C:T count as matches)– two hits strategy like blastp
Species-pair specific base substitution matrixAlignment scores dynamically adjusted for local complexityBetween adjacent alignments: search repeated with more sensitive parametersOutput:
– Coordinates of local alignments and scores– For each alignment, coordinates and scores of HSPs (high-scoring
segment pairs)
Comparative gene prediction:
Categories of gene prediction programs:– Single-genome ab initio (classical gene prediction problem)– Dual-genome ab initio (comparative gene prediction)– EST-, mRNA, and protein-based methods (same or related species)– Anything goes (e.g. ENSEMBL gene annotation pipeline)
Why should comparative gene prediction be more effective:– Consensus of two predictions more reliable than one – Coding regions more conserved than non-coding regions (??)– Coding regions have different substitution and indel patterns (frame
conservation)– Splice junctions need to be conserved as well
Some limitations:– Genomes need to be reasonably diverse– Gene structures may not be conserved
Gene prediction: statement of the problem
Given a genome sequence:
a) predict the structure of all transcripts
b) Predict the structure of the coding part of all transcripts.
Performance criteria:
a) # of correctly predicted/missed genes
b) # of correctly predicted/missed exons
c) # of correctly predicted/missed coding nucleotides
Further complications:
• It is not known in advance how many genes a sequence contains
• The sequence may start or end in the middle of a gene
• Alternative splicing
GENSCAN, and example of an ab initio gene finding algorithm
Principle: GENSCAN finds the optimal “parse” of a sequence
A “parse” is a succession of:• intergenic regions• 5’UTR (untranslated regions)• Exons• Introns• 3’UTRs
Evaluation of alternative parses with the aid of:• Weight matrices or similar models for sites: promoters, translation starts,
splice donors and acceptors, translation stops, polyadenylation sites.• Interpolated Markov chains (3-periodic HMMs), and length distributions
for exons, introns, 5’ and 3’ UTRs, and intergenic regions.
GENESCAN model (variant HMM)
E0+ = exon + plus starting in
phase 0I0
+ = exon on + starting in phase 1
All named elements except promoters are scored by Markov chains or interpolated Markov chains.
Transitions and promoters are scored by weight matrices or similar descriptors
Gene Prediction: The Intron Phase Problem
ATGGGTCCAgttggtgcccttagGTGTTCGAgtgagccacagCACCTGGAAGMetGlyPro..............ValTyrAs...........pThrTrpLys
EInit I0+ E0
+ I2+ E2
+
Introns can be inserted at three different codon positions:
Introns inserted between codons have phase 0Introns inserted after the first codon base have phase 1Introns inserted after the second codon position have phase 2Exons starting between codons have phase 0Exons starting after the first codon base have phase 1Exons starting after the second codon base have phase 2
Important rule: An intron of a given phase must be followed by an exon of the same phase. Otherwise, the protein will not be translated in the correct frame.
Scoring target Scoring method:
F+ 5’UTR simple Markov chain F+ → Einit
+ Kozak consensus weight matrixEinit First exon interpolated Markov chainEinit → I1
+ Splice donor weigh matrixI1
+ Intron phase 1 interpolated Markov chainI1
+ → E1+ Splice acceptor weigh matrix - like
Example of a dual gene prediction program: SLAM
Input:– two sequences
– Approximate alignments generated with Avid or Mummer
Coupled gene prediction and alignment optimization
Based on a generalized pair HMM (GPHMM)– gene model analogous to GENSCAN
– arbitrary length distributions (two sets of durations instead of one)
– exon model based on codon pairs in different alignment configurations (new!)
– Splice junctions modeled by VLMMs (non-stationary variable-length Markov models)
Models individually trained for different species
Other examples of dual gene prediction programs:– SGP2
– TWINSCAN
Evaluation of gene predictions(from Guigo et al. 2006, Genome Biol. 7:S2)
Evaluation at gene level
Gene transcript evaluation. Computing sensitivity and specificity at transcript level: (a) complete transcript annotation; (b) incomplete transcript annotation. Transcripts marked with an asterisk are considered 'consistent with the annotation' and will be scored as correct.
Performance Evaluation of gene prediction programs based on experimentally characterized EGASP regions
Computational Approaches to Gene Regulatory elements
• Finding a common sequence motif in a set of sequences known to contain a binding site to the same transcription factor or to confer the same regulatory property to an adjacent gene. Bottleneck: appropriate data sets, computer algorithms. New perspective: ChIP-chip data.
• Identification of sequence motifs that are over-represented at a particular distance from transcription initiation sites. Bottleneck: large sets of experimentally mapped promoters. Hope comes from mass genome annotation (MGA) data (CAGE, etc.).
• Identification of transcription factor binding sites and other sequence elements in DNA regulatory sequences. Bottleneck: accurate and reliable motif descriptions.
• Development of promoter prediction algorithms. Current bottleneck: large sets of experimentally mapped promoters. Perspective: MGA data.
• Identification and interpretation of conserved non-coding sequence regions between orthologous genes of related organisms. Current bottleneck: concepts and models of gene regulatory regions.
Commonly addressed problems and corresponding bottlenecks:
Comparative genomics may help solving these problems
Transcription Factor Binding Sites: Features and Facts
Degenerate sequence motifs
Typical length: 6-20 bp
Low information content: 8-12 bits (1 site per 250-4000 bp)
Quantitative recognition mechanism: measurable affinity of different sites may vary over three orders of magnitude
Regulatory function often depends on cooperative interactions with neighboring sites
Limitations of Motif Discovery Algorithms
A recent paper show that computational motif discovery is disappointingly ineffective:
Bad Performance of Motif Discovery algorithms on Eukaryotic Benchmark Data Sets (Results from Tompa et. al. 2005)
Similar results are obtained with prokaryotic benchmark datasets
Comparative Motif Discovery: Problem statement and strategies
Problem 1: Motif discoveryInput: several sets of aligned or unaligned sequences.Output: a set of over-, under-, or class-associated motifsMotifs may be represented as consensus sequences or weight matricesCase study: Xie et al. (2005) mammalian promoters and 3’UTRsProgram example: PhyloGibbs (Siddhartan et al. 2005)
http://www.phylogibbs.unibas.ch/cgi-bin/phylogibbs.pl
Problem 2: Motif searchInput: one set of aligned or unaligned sequences, a library of motif definitions (e.g. TRANSAC matrices)Output: conserved binding site locationsProgram example: ConSite (Sandelin et al. 2004), CisOrtho (Bigelow et al. 2004)
New human promoter motifs discovered by comparative approach
Data from Xie et al. 2005, Nature 434, 338-345