sequence analysis methods for comparative genomics

Sequence Analysis Methods for comparative Genomics

Comparative Genomics

SIB PhD program

Lausanne Feb 12-16 2007

Philipp Bucher

Topics covered

Alignment methods for genomic DNA sequences

Comparative gene prediction

Comparative motif finding

Genome sequence comparison - Challenges

Very large sequences – up to several 100 million bp per chromosome

Only a small percentage of the genomes may be alignable.

Chromosomal rearrangements:– Insertions

– Translocations

– Duplications

Repetitive elements:– simple repeats

– lineage specific transposons

– ancient transposons

– Processed pseudogenes

Nothing in biology makes sense except in the light of evolutionTh. Dobzhansky

In fact, many contemporary biologists don’t care about evolution but are more interested in molecular function or drug discovery.

To those, comparative genomics is an extremely powerful genome annotation technique.

Genome sequence comparison may helpful in:

the assembly of a new genomes

predicting gene structures

the identification of non-coding regulatory elements

In the recognition of orthologs

to discarding sequence regions not of interest in a particular context

for structural and functional inferences from substitution patterns

Sequence comparison tools overview

Algorithm, program Type Speed Sensitivity

dotter Dot matrix − ++

Lbdot Dot matrix + +

Smith-Waterman Alignment − − +++

blastp Alignment ++ +++

Blastn Alignment ++ −

Megablast, SASHA alignment +++ − −

Avid, Mummer, Lagan alignment + +

Blastz alignment ++ +

BLAST heuristics for protein sequence similarity searches

1. Query compilation: For each position of the query sequence, compile a list of 3-letter words which match the query with score threshold T (score depends on substitution matrix).

2. Word search: Search for two non-overlapping word matches on the same diagonal of the path matrix within a distance A.

3. Un-gapped match extension: Extend the un-gapped alignment along the diagonal in both directions unless the score drops more than X below the maximal score yet attained.

4. Gapped alignment extension: For un-gapped alignments (segment pairs) exceeding a threshold score Sg , trigger a regional gapped alignment until the score drops more than Xg below the maximal score yet attained.

5. Statistical evaluation of gapped alignments: Blast uses hard-wired extreme value distribution parameters λ and K for specific scoring systems (substitution matrix + gap penalties)

Examples of word match scores with Blosum62 matrix

WWW -> 33WWW

YYY -> 21 YYY

GGG -> 18GGG

QQQ -> 15QQQ

SSS -> 13SSS

HSW -> 20HAW

LFY -> 12 IYY

EWD -> 15DWE

KFI -> 8RYV

EWC -> -11FDE

# Matrix made by matblas from blosum62.iij# * column uses minimum score# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units# Blocks Database = /data/blocks_5.0/blocks.dat# Cluster Percentage: >= 62# Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Blast: Principle of two-hit heuristic

Blast: Gap extension.

a: Area considered by gap extension algorithm. Computation starts at the center of the highest scoring segment pair obtained after word-hit extension. b: Path of optimal local alignment found. c: Optimal local alingment.

Why is blastn less sensitive than blastp ?

blastp blastn

Word hits length 3, inexact, based on substitution matrix, two hits

length 11 (default), exact, one hit

Scoring system: Blosum or Pam substitution matrix, affine gap penalties

Identity-difference (default 1, −3), affine gap penalties

Statistics: Karlin-Altschul for gapped alignments

Karlin-Altschul for ungapped alignments (→ non-conservative estimates for gapped alignments )

Blastz overview:Searches for multiple gapped local alignments

– ordering constraints may apply Same 3-step procedure as blast:

– word hit, ungapped extension, gapped extension Word hits:

– exact (default 8) or patterned, e.g. 1110100110010101111 (1 indicates relevant positions).

– optionally transition-tolerant (A:G, C:T count as matches)– two hits strategy like blastp

Species-pair specific base substitution matrixAlignment scores dynamically adjusted for local complexityBetween adjacent alignments: search repeated with more sensitive parametersOutput:

– Coordinates of local alignments and scores– For each alignment, coordinates and scores of HSPs (high-scoring

segment pairs)

Comparative gene prediction:

Categories of gene prediction programs:– Single-genome ab initio (classical gene prediction problem)– Dual-genome ab initio (comparative gene prediction)– EST-, mRNA, and protein-based methods (same or related species)– Anything goes (e.g. ENSEMBL gene annotation pipeline)

Why should comparative gene prediction be more effective:– Consensus of two predictions more reliable than one – Coding regions more conserved than non-coding regions (??)– Coding regions have different substitution and indel patterns (frame

conservation)– Splice junctions need to be conserved as well

Some limitations:– Genomes need to be reasonably diverse– Gene structures may not be conserved

Gene prediction: statement of the problem

Given a genome sequence:

a) predict the structure of all transcripts

b) Predict the structure of the coding part of all transcripts.

Performance criteria:

a) # of correctly predicted/missed genes

b) # of correctly predicted/missed exons

c) # of correctly predicted/missed coding nucleotides

Further complications:

• It is not known in advance how many genes a sequence contains

• The sequence may start or end in the middle of a gene

• Alternative splicing

GENSCAN, and example of an ab initio gene finding algorithm

Principle: GENSCAN finds the optimal “parse” of a sequence

A “parse” is a succession of:• intergenic regions• 5’UTR (untranslated regions)• Exons• Introns• 3’UTRs

Evaluation of alternative parses with the aid of:• Weight matrices or similar models for sites: promoters, translation starts,

splice donors and acceptors, translation stops, polyadenylation sites.• Interpolated Markov chains (3-periodic HMMs), and length distributions

for exons, introns, 5’ and 3’ UTRs, and intergenic regions.

GENESCAN model (variant HMM)

E0+ = exon + plus starting in

phase 0I0

+ = exon on + starting in phase 1

All named elements except promoters are scored by Markov chains or interpolated Markov chains.

Transitions and promoters are scored by weight matrices or similar descriptors

Gene Prediction: The Intron Phase Problem

ATGGGTCCAgttggtgcccttagGTGTTCGAgtgagccacagCACCTGGAAGMetGlyPro..............ValTyrAs...........pThrTrpLys

EInit I0+ E0

+ I2+ E2

+

Introns can be inserted at three different codon positions:

Introns inserted between codons have phase 0Introns inserted after the first codon base have phase 1Introns inserted after the second codon position have phase 2Exons starting between codons have phase 0Exons starting after the first codon base have phase 1Exons starting after the second codon base have phase 2

Important rule: An intron of a given phase must be followed by an exon of the same phase. Otherwise, the protein will not be translated in the correct frame.

Scoring target Scoring method:

F+ 5’UTR simple Markov chain F+ → Einit

+ Kozak consensus weight matrixEinit First exon interpolated Markov chainEinit → I1

+ Splice donor weigh matrixI1

+ Intron phase 1 interpolated Markov chainI1

+ → E1+ Splice acceptor weigh matrix - like

Example of a dual gene prediction program: SLAM

Input:– two sequences

– Approximate alignments generated with Avid or Mummer

Coupled gene prediction and alignment optimization

Based on a generalized pair HMM (GPHMM)– gene model analogous to GENSCAN

– arbitrary length distributions (two sets of durations instead of one)

– exon model based on codon pairs in different alignment configurations (new!)

– Splice junctions modeled by VLMMs (non-stationary variable-length Markov models)

Models individually trained for different species

Other examples of dual gene prediction programs:– SGP2

– TWINSCAN

Evaluation of gene predictions(from Guigo et al. 2006, Genome Biol. 7:S2)

http://genomebiology.com/2006/7/Suppl+1/S2/figure/F3?highres=y

Evaluation at gene level

Gene transcript evaluation. Computing sensitivity and specificity at transcript level: (a) complete transcript annotation; (b) incomplete transcript annotation. Transcripts marked with an asterisk are considered 'consistent with the annotation' and will be scored as correct.


Performance Evaluation of gene prediction programs based on experimentally characterized EGASP regions


Computational Approaches to Gene Regulatory elements

• Finding a common sequence motif in a set of sequences known to contain a binding site to the same transcription factor or to confer the same regulatory property to an adjacent gene. Bottleneck: appropriate data sets, computer algorithms. New perspective: ChIP-chip data.

• Identification of sequence motifs that are over-represented at a particular distance from transcription initiation sites. Bottleneck: large sets of experimentally mapped promoters. Hope comes from mass genome annotation (MGA) data (CAGE, etc.).

• Identification of transcription factor binding sites and other sequence elements in DNA regulatory sequences. Bottleneck: accurate and reliable motif descriptions.

• Development of promoter prediction algorithms. Current bottleneck: large sets of experimentally mapped promoters. Perspective: MGA data.

• Identification and interpretation of conserved non-coding sequence regions between orthologous genes of related organisms. Current bottleneck: concepts and models of gene regulatory regions.

Commonly addressed problems and corresponding bottlenecks:

Comparative genomics may help solving these problems

Transcription Factor Binding Sites: Features and Facts

Degenerate sequence motifs

Typical length: 6-20 bp

Low information content: 8-12 bits (1 site per 250-4000 bp)

Quantitative recognition mechanism: measurable affinity of different sites may vary over three orders of magnitude

Regulatory function often depends on cooperative interactions with neighboring sites

Limitations of Motif Discovery Algorithms

A recent paper show that computational motif discovery is disappointingly ineffective:

Bad Performance of Motif Discovery algorithms on Eukaryotic Benchmark Data Sets (Results from Tompa et. al. 2005)

Similar results are obtained with prokaryotic benchmark datasets

Comparative Motif Discovery: Problem statement and strategies

Problem 1: Motif discoveryInput: several sets of aligned or unaligned sequences.Output: a set of over-, under-, or class-associated motifsMotifs may be represented as consensus sequences or weight matricesCase study: Xie et al. (2005) mammalian promoters and 3’UTRsProgram example: PhyloGibbs (Siddhartan et al. 2005)

http://www.phylogibbs.unibas.ch/cgi-bin/phylogibbs.pl

Problem 2: Motif searchInput: one set of aligned or unaligned sequences, a library of motif definitions (e.g. TRANSAC matrices)Output: conserved binding site locationsProgram example: ConSite (Sandelin et al. 2004), CisOrtho (Bigelow et al. 2004)

http://www.phylogibbs.unibas.ch/cgi-bin/phylogibbs.pl

New human promoter motifs discovered by comparative approach

Data from Xie et al. 2005, Nature 434, 338-345

sequence analysis methods for comparative genomics

Documents

query sequence

score threshold t score

maximal score

minimum score

sequence regions

gapped alignment extension

threshold score sg

ungapped match extension