computational

DNA Properties and DNA Properties and

Genetic Sequence AlignmentGenetic Sequence Alignment

CSE, Marmara University CSE, Marmara University

mimoza.marmara.edu.tr/~m.sakalli/cse546mimoza.marmara.edu.tr/~m.sakalli/cse546

Oct/19/09Oct/19/09

ComputationalMolecularBiology

Bioinformatics

GenomicsGenomics

ProteomicsFunctionalgenomics

Structuralbioinformatics


ComputationalMolecularBiology

Bioinformatics

GenomicsGenomics

ProteomicsFunctionalgenomics



Sequence Alignment and Why

Global Alignment Global Alignment Local Alignment Local Alignment

Suppose a cloned geneSuppose a cloned geneIf it is already in databases. Database If it is already in databases. Database Accession, Annotation Accession, Annotation

(summary of structure), expression profile? Mutants?(summary of structure), expression profile? Mutants?Its protein characteristics? Its protein characteristics? -Sub-localization -Soluble? -3D fold-Sub-localization -Soluble? -3D foldIs there conserved regions?Is there conserved regions?-Alignments?-Domains?-Alignments?-Domains?Is there similar sequences?Is there similar sequences?-% identity?-Family member?-% identity?-Family member?Evolutionary relationship?Evolutionary relationship?-Phylogenetic tree-Phylogenetic tree

Scoring MatricesScoring MatricesAlignment with Affine Gap PenaltiesAlignment with Affine Gap PenaltiesApplying algorithms to analyze genomics dataApplying algorithms to analyze genomics dataApplying Manhattan Tourist Problem to sequence comparisonApplying Manhattan Tourist Problem to sequence comparison

Local vs. Global Alignment Global AlignmentGlobal Alignment

Local Alignment—better alignment to find conserved Local Alignment—better alignment to find conserved segmentsegment

--T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C

tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||aattgccgccgtcgttttcagCAGTTATGTCAGatc

Global alignment

Local alignment a “mini” Global Alignment to get Local

Some genes only have small Some genes only have small conservedconserved regions between species of organisms regions between species of organisms

Example: Homeobox genes have a short region called the Example: Homeobox genes have a short region called the homeodomainhomeodomain that is highly that is highly conservedconserved between species. A global alignment would not find the homeodomain between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence.because it would try to align the ENTIRE sequence.

Find long ORFs, the longest ORFs first and putting together a set with Find long ORFs, the longest ORFs first and putting together a set with minimal overlaps, identify the furthest upstream start codon. minimal overlaps, identify the furthest upstream start codon.

A new found ORF might not contain a real gene. Compare it with a gene A new found ORF might not contain a real gene. Compare it with a gene known from other species.known from other species.

Different DNA sequences will give identical proteins (remember from the Different DNA sequences will give identical proteins (remember from the codon table). So there is large amount of redundancy in DNA sequence. codon table). So there is large amount of redundancy in DNA sequence.

Conservation of a sequence between species strongly suggests that the Conservation of a sequence between species strongly suggests that the sequence has a function that is being sequence has a function that is being conserved by natural selectionconserved by natural selection..

A protein being functional is more likely being conservedA protein being functional is more likely being conserved in evolutionary in evolutionary process than DNA.process than DNA. The organism’s survival depends on the protein The organism’s survival depends on the protein being functional. being functional.

The protein 3-dimensional structure is even more conserved, because it is The protein 3-dimensional structure is even more conserved, because it is more closely related to enzyme activity than the amino acid sequence is.more closely related to enzyme activity than the amino acid sequence is.

Research subjects: To build 3-D structure from a DNA Research subjects: To build 3-D structure from a DNA sequencesequence

Sequence ComparisonSequence Comparison

Comparing ORF sequence to Comparing ORF sequence to a database of known a database of known protein sequences from protein sequences from many species.many species.

BLAST is the standard BLAST is the standard (BLAST = Basic Local (BLAST = Basic Local Alignment Search Tool)Alignment Search Tool)

BLAST is based on the BLAST is based on the concept that if you concept that if you compare the same (that compare the same (that is, homologous) protein is, homologous) protein from many different from many different species. species.

- Some amino acids readily - Some amino acids readily substitute for each other substitute for each other and some others almost and some others almost are unique and never will are unique and never will substitute.substitute.

A A substitution matrixsubstitution matrix, giving a , giving a score for each amino acid score for each amino acid position in the proteins position in the proteins being comparedbeing compared

These weights are given based on biological evidence. These weights are given based on biological evidence. Alignments can be thought of as mutations in the sequence. Alignments can be thought of as mutations in the sequence. Some of these mutations have little effect on the organism’s Some of these mutations have little effect on the organism’s

function, therefore some penalties, function, therefore some penalties, δδ(vi , wj), will be less (vi , wj), will be less harsh than others.harsh than others.

Terminology: Terminology: query sequence: sequence enteredquery sequence: sequence entered. Sequences matching . Sequences matching are are subject sequencessubject sequences..

Gene B. meg. It is 174 amino acids lon, written in “fasta” format: Gene B. meg. It is 174 amino acids lon, written in “fasta” format: Starts with > and immediately followed by an identifier Starts with > and immediately followed by an identifier (ORF00135), and then some comments. And then follows the (ORF00135), and then some comments. And then follows the sequence.sequence.

Scoring matrices: Scoring matrices: Scoring matrices are created based on biological evidences. Scoring matrices are created based on biological evidences. Alignments can be thought of as two sequences that differ due to Alignments can be thought of as two sequences that differ due to

mutations in the sequence. mutations in the sequence. Some of these mutations have little effect on the organism’s function, Some of these mutations have little effect on the organism’s function,

therefore some penalties, therefore some penalties, δδ(v(vii , w , wjj), will be less harsh than others.), will be less harsh than others.Different amino acids might have a positive score. Ie due to having Different amino acids might have a positive score. Ie due to having

positively charged.positively charged.

Score S is calculated by summing the scores assigned for matches, Score S is calculated by summing the scores assigned for matches, mismatches and gaps (creation/extension scores).mismatches and gaps (creation/extension scores).

The scores are given by the specified substitution matrix. The scores are given by the specified substitution matrix.

PAM (Percent Accepted Mutation): for evolutionary studies. PAM (Percent Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids For example in PAM1, 1 accepted point mutation per 100 amino acids

is required. is required. BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding

common motifs. For example in BLOSUM62, the alignment is common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.created using sequences sharing no more than 62% identity.

How the matrices were created:How the matrices were created:Very similar sequences were aligned.Very similar sequences were aligned.From these alignments, the frequency of substitution between each From these alignments, the frequency of substitution between each

pair of amino acids was calculated and then PAM1 was built.pair of amino acids was calculated and then PAM1 was built.After normalizing to log-odds format, the full series of PAM matrices After normalizing to log-odds format, the full series of PAM matrices

can be calculated by multiplying the PAM1 matrix by itself.can be calculated by multiplying the PAM1 matrix by itself.

>ORF00135 |chromosome >ORF00135 |chromosome 538197-538721 revcomp 538197-538721 revcomp MKAKLIQYVYDAECRLFKSVNQHFMKAKLIQYVYDAECRLFKSVNQHFDRKHLNRFLRLLTHAGGATFTIVIDRKHLNRFLRLLTHAGGATFTIVIACLLLFLYPSSVAYACAFSLAVSHACLLLFLYPSSVAYACAFSLAVSHIPVAIAKKLYPRKRPYIQLKHTKVIPVAIAKKLYPRKRPYIQLKHTKVLENPLKDHSFPSGHTTAIFSLVTPLENPLKDHSFPSGHTTAIFSLVTPLMIVYPAFAAVLLPLAVMVGISRILMIVYPAFAAVLLPLAVMVGISRIYLGLHYPTDVMVGLILGIFSGAVAYLGLHYPTDVMVGLILGIFSGAVALNIFLTLNIFLT

- Some matrices reflect similarity: good for database searching- Some matrices reflect similarity: good for database searching- Some reflect distance: good for phylogenies- Some reflect distance: good for phylogenies- Log-odds matrices, a normalization method for matrix values:- Log-odds matrices, a normalization method for matrix values:

SSij ij = log (q= log (qijij/(p/(pii p pjj))=log (q))=log (qijij) – log (p) – log (pii) – log (p) – log (pjj))SSijij is the probability that two residues, i and j, are aligned by evolutionary descent is the probability that two residues, i and j, are aligned by evolutionary descent

and by chance.and by chance.qqijij are the frequencies that i and j are observed to align in sequences known to be are the frequencies that i and j are observed to align in sequences known to be

related. related. ppii and p and pjj are their frequencies of occurrence in the set of sequences. are their frequencies of occurrence in the set of sequences.

The most widely used local similarity algorithms are:The most widely used local similarity algorithms are:Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)

Basic Local Alignment Search and Fast Alignment, which are based on Basic Local Alignment Search and Fast Alignment, which are based on k-tuplek-tuple algorithms…algorithms…

Speedwise: BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a lots Speedwise: BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a lots of computational powerof computational power

Sensitivity/statistics: Sensitivity/statistics: FASTA is more sensitive to variations, misses less homologuesFASTA is more sensitive to variations, misses less homologuesSmith-Waterman is even more sensitive. Smith-Waterman is even more sensitive. BLAST calculates probabilitiesBLAST calculates probabilitiesFASTA more accurate for DNA-DNA search then BLASTFASTA more accurate for DNA-DNA search then BLAST

BLAST and FASTA variantsComparison between,

FASTA: a DNA query to DNA db, or a protein query to protein db

FASTX: a translated DNA query to a protein database

TFASTA: a protein query to a translated DNA database BLASTN: a DNA query to DNA database. BLASTP: a protein query to protein database. BLASTX: 6-frame translations of DNA query to protein database.TBLASTN: a protein query to the 6-frame translations of a DNA database.TBLASTX: 6-frame translations of DNA query to the 6-frame translations of a DNA database.PSI-BLAST: Performs iterative database searches. The results from each round are incorporated into a 'position specific' score matrix, which is used for further searching

BLAST results

Detailed BLAST results

E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score.

Mostly genes are named with the function of their protein.at some Mostly genes are named with the function of their protein.at some point, some related genes had their function determined through point, some related genes had their function determined through lab work: by examining the effects of mutations in the gene, by lab work: by examining the effects of mutations in the gene, by isolating and studying the protein produced by the gene, etc. isolating and studying the protein produced by the gene, etc.

Enzymes (end in –ase), transport across the cell membrane, genetic Enzymes (end in –ase), transport across the cell membrane, genetic information processing (DNA->RNA->protein), structural proteins, information processing (DNA->RNA->protein), structural proteins, sporulation and germination, and more!sporulation and germination, and more!

Many genes (maybe 1/4 of them in a typical genome) have no known Many genes (maybe 1/4 of them in a typical genome) have no known function, although they are found in several different species: function, although they are found in several different species: conserved hypotheticalconserved hypothetical genes genes

Every new genome has some genes that are unique: Every new genome has some genes that are unique: no matching no matching BLAST hits in the database.BLAST hits in the database.

Are they real genes? Sometimes there is evidence in the form of Are they real genes? Sometimes there is evidence in the form of messenger RNA, but usually we don’t know call them messenger RNA, but usually we don’t know call them hypotheticalhypothetical genes genes

““putativeputative” means that we think we know the gene’s function but we ” means that we think we know the gene’s function but we aren’t sure. Putative should be followed by the function name.aren’t sure. Putative should be followed by the function name.

(Basic Similarity) Homology Search, The Knuth-Morris-Pratt Algorithm (exact match)

All similarity searching methods rely on the concepts of All similarity searching methods rely on the concepts of alignmentalignment and a and a distancedistance measurements between. measurements between.

Associated with a similarity score calculated from a distance matrix for Associated with a similarity score calculated from a distance matrix for example the number of DNA base that are different between. example the number of DNA base that are different between.

Exact String Matching Exact String Matching Naive brute force algorithm searching for pattern p in text t, lengths m=|p| Naive brute force algorithm searching for pattern p in text t, lengths m=|p| and n=|t|, then the worst case complexity Θ(and n=|t|, then the worst case complexity Θ(nmnm) . Sliding the pattern across ) . Sliding the pattern across from left to right, one step at a time, but it is possible to shift more than from left to right, one step at a time, but it is possible to shift more than one, given a certain length of the pattern sought is detected in the text. one, given a certain length of the pattern sought is detected in the text. Just imagine it.. Just imagine it..

The Knuth-Morris-Pratt AlgorithmThe Knuth-Morris-Pratt Algorithm - Complexity - ComplexityThe algorithm of KMP algorithm takes into account the information gained The algorithm of KMP algorithm takes into account the information gained

during previous symbol comparisons, with prefix and suffix definitions. during previous symbol comparisons, with prefix and suffix definitions. Never re-compares a text symbol that has matched a pattern symbol. Never re-compares a text symbol that has matched a pattern symbol.

The complexity of the searching phase is in The complexity of the searching phase is in OO((nn). ).

Its preprocessing stage (the pattern search) has a complexity of Its preprocessing stage (the pattern search) has a complexity of OO((mm). ). m<=n.m<=n. Therefore overall complexity is in Therefore overall complexity is in OO((nn).).

The Knuth-Morris-Pratt AlgorithmThe Knuth-Morris-Pratt Algorithm - Definitions - Definitions

Definition: Let A be an alphabet, and x be a string of length k, x=x_0, x_1, …, x_(k) over A.Definition: Let A be an alphabet, and x be a string of length k, x=x_0, x_1, …, x_(k) over A.A A prefixprefix of x, is a substring u, u=u_0, u_1, …,u_b, where b<k, 0<={b,k}, of x, is a substring u, u=u_0, u_1, …,u_b, where b<k, 0<={b,k}, A A suffixsuffix of x, is a substring u, u=u_{k-b}, u_{k-b+1}, …, u_{k}, where 0<b<=k. of x, is a substring u, u=u_{k-b}, u_{k-b+1}, …, u_{k}, where 0<b<=k. Called Called proper prefix or suffixproper prefix or suffix of u, if b<k. of u, if b<k. A A border of xborder of x is a substring which appears as a proper suffix and a proper prefix of x and its is a substring which appears as a proper suffix and a proper prefix of x and its

length is length is bb. . Example x=abacab. Example x=abacab. Proper PrefixesProper Prefixes: : єє, a, ab, aba, abac, abaca. , a, ab, aba, abac, abaca. Proper SuffixesProper Suffixes: : єє, b, ab, , b, ab,

cab, acab, bacab. cab, acab, bacab. BordersBorders: : єє, ab, with lengths of 0 and 2. , ab, with lengths of 0 and 2. The empty string The empty string єє is always a border of is always a border of xx, but itself has no border. , but itself has no border.

0123401234556789...6789...abcababcabccabd abd abcababcabdd abcabdabcabd

Matching prefix p=abcab, length 5, mismatch occurs at position 5 between c and b, the Matching prefix p=abcab, length 5, mismatch occurs at position 5 between c and b, the widest (border length) - of the prefix of p matching to suffix of p -, b=2, then the shift widest (border length) - of the prefix of p matching to suffix of p -, b=2, then the shift distance is |p|-|b|=5-2.distance is |p|-|b|=5-2.

Theorem: Let Theorem: Let rr, , ss be borders of a string be borders of a string xx, where |, where |rr| < || < |ss|. Then |. Then rr is a border of is a border of ss. . Proof. Proof.

The Knuth-Morris-Pratt AlgorithmThe Knuth-Morris-Pratt Algorithm – Preprocessing generating a look-up – Preprocessing generating a look-up tabletable

Definition: Let Definition: Let xx be a string and be a string and aa AA a symbol. A border a symbol. A border rr of of xx can be extended can be extended by by aa, if , if rara (r appended a) is a border of (r appended a) is a border of xaxa. .

jj:: 01234560123456pp[[jj]:]: ababaaababaabb[[jj]:]: -1001231-1001231void kmpPreprocess()void kmpPreprocess(){{ int i=0, j=-1;int i=0, j=-1; b[i] = j;b[i] = j; while (i<m) while (i<m) ////OO(m) length of the pattern(m) length of the pattern {{ //initial j=-1, i=0; j<i..//initial j=-1, i=0; j<i.. while (j>=0 && p[i]!=p[j]) j=b[j]; while (j>=0 && p[i]!=p[j]) j=b[j]; //if mismatch, reduce j, max reduction until //if mismatch, reduce j, max reduction until

j<0 j<0 i++; j++;i++; j++; //if increase with.. //if increase with.. b[i] = j;b[i] = j; }}}}

The Knuth-Morris-Pratt AlgorithmThe Knuth-Morris-Pratt Algorithm – – SearchingSearching void kmpSearch()void kmpSearch(){{ int i=0, j=0;int i=0, j=0; while (i<n)while (i<n) //Text length//Text length {{ while (j>=0 && t[i]!=p[j]) j=b[j];while (j>=0 && t[i]!=p[j]) j=b[j]; //if mismatch, max reduction to –j; //if mismatch, max reduction to –j; i++; j++;i++; j++; //max j can be m, hits to a match//max j can be m, hits to a match if (j==m)if (j==m) //m pattern length//m pattern length {{ report(i-j);report(i-j); j=b[j];j=b[j]; }} }}}}

Example: Example: 01230123 44567567 89...89...abababab bbabaaba aaabababab aacc abab aabacbac

aababbab acac abab abab acac

abab abacabac

Max number of comparisons 2n.Max number of comparisons 2n.

//border length b[j] lookup//border length b[j] lookup

jj:: 01234560123456

pp[[jj]:]: ababaaababaa

bb[[jj]:]: -1001231 -1001231

Boyer-Moore algorithmStarts comparison from the rightmost, and if the rightmost does not match and if not occurs in Starts comparison from the rightmost, and if the rightmost does not match and if not occurs in

the pattern at all, then the pattern can be shifted by the pattern at all, then the pattern can be shifted by m.m. The Boyer-Moore algorithm uses two different heuristics The Boyer-Moore algorithm uses two different heuristics for determining the maximum possible for determining the maximum possible

shift distance in case of a mismatch:shift distance in case of a mismatch: the " the "bad characterbad character" and the "" and the "good suffixgood suffix" " heuristicsheuristics. . Both heuristics can lead to a shift distance of Both heuristics can lead to a shift distance of mm. .

0 1 2 3 4 5 6 7 8 9 ...0 1 2 3 4 5 6 7 8 9 ... 0 1 2 3 4 5 6 7 8 9 ...0 1 2 3 4 5 6 7 8 9 ...a b b a a b b a dd a b a c b a a b a c b a a a b a b a b a c b aa a b a b a b a c b ab a b a b a b a cc a b b a b a b b a b

b a b a cb a b a c a b b a b //borders matchinga b b a b //borders matchingOccurrence Function. Occurrence Function. occocc((pp, , aa) = max{ ) = max{ jj | | ppjj = = aa}, }, occocc(text, x) = 2, (text, x) = 2, occocc(text, t) = max{0, 3}=3. (text, t) = max{0, 3}=3. The pattern is shifted by the longest of the two distances given by the bad chrtr and the good The pattern is shifted by the longest of the two distances given by the bad chrtr and the good

suffix heuristics. suffix heuristics.

Left: Bad suffix character. ELeft: Bad suffix character. Example of bad pattern heuristics (a special case of good suffix xample of bad pattern heuristics (a special case of good suffix heuristics). Needs preprocessing, borders.. heuristics). Needs preprocessing, borders..

Requires only Requires only OO(n/m) comparisons, if always the first symbol mismatch occurs. (n/m) comparisons, if always the first symbol mismatch occurs. The preprocessing for the good suffix heuristics is rather difficult to understand and to The preprocessing for the good suffix heuristics is rather difficult to understand and to

implement. implement.

Therefore, modified versions of the BM algorithm in which the good suffix heuristics are Therefore, modified versions of the BM algorithm in which the good suffix heuristics are avoided. The argument is that the bad character heuristics would avoid comparisons, while avoided. The argument is that the bad character heuristics would avoid comparisons, while good suffix heuristics would not avoid. good suffix heuristics would not avoid.

Bad character heuristics, - the Bad character heuristics, - the Horspool algorithmHorspool algorithm or the or the Sunday algorithmSunday algorithm suit better. suit better.

Bad-character mismatch of BM, versus Horspool Algorithm, and Bad-character mismatch of BM, versus Horspool Algorithm, and Sunday’s AlgorithmSunday’s Algorithm

0 1 2 3 4 5 6 7 8 9 ...0 1 2 3 4 5 6 7 8 9 ... 0 1 2 3 4 5 6 7 8 9 ...0 1 2 3 4 5 6 7 8 9 ... 0 1 2 3 4 5 6 7 8 9 ...0 1 2 3 4 5 6 7 8 9 ...a b a b cc a b d b a c b a a b d b a c b a a b c a b d b a c b aa b c a b d b a c b a a b c a b a b c a b dd b a c b a b a c b ab c b c b b a ba b b c b b c b a ba b b c b b c b a b a b b b cc b a b b a b b c b a b b c b a b b b cc b a b b a b

the rightmost position of the rightmost position of aa in in pp00 ... ... ppmm-2-2, or -1, , or -1, occocc(text, x) = 2, (text, x) = 2, occocc(text, t) = 0, (text, t) = 0, occocc(next, (next, t) = -1. t) = -1.

Occurrence at the leftmost (last bit) is not taken into count, best case Occurrence at the leftmost (last bit) is not taken into count, best case performance performance OO(n/m) (n/m)

On the Sunday’s shifts left of the d, since d does not occur in the pattern, and On the Sunday’s shifts left of the d, since d does not occur in the pattern, and the comparison can be depend on the symbol probabilities, if known, then the comparison can be depend on the symbol probabilities, if known, then the least probable symbol in the pattern is compared first, hoping that it the least probable symbol in the pattern is compared first, hoping that it does not match, for the pattern be shifted.does not match, for the pattern be shifted.

Skip Search AlgorithmSkip Search Algorithm searches least likely pattern!!!.. searches least likely pattern!!!..Minimize the false positive matches. Minimize the false positive matches. Information theoretic approach: Repetitive subsequences (ie poly AT) has low Information theoretic approach: Repetitive subsequences (ie poly AT) has low

information content, random.. information content, random.. Minimum Message Length, (MML).. uses HMM (or PFSM).. Minimum Message Length, (MML).. uses HMM (or PFSM)..

Next week: Aligning Two Strings

Represents each row and each column with a number and a symbol of the sequence present up to a given position. For example the sequences are represented as:

www.bioalgorithms.info\Winfried Just

Alignment as a Path in the Edit Graph

0 1 2 2 0 1 2 2 33 4 5 6 7 7 4 5 6 7 7 A T _ A T _ GG T T A T _ T T A T _ A T C A T C GG T _ A _ C T _ A _ C0 1 2 3 0 1 2 3 44 5 5 6 6 7 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4),(3,4), (4,5), (5,5), (6,6), (4,5), (5,5), (6,6), (7,6), (7,7)(7,6), (7,7)

computational

Documents

sequence comparisonlocal

query sequence

entire sequence

amino acid sequence

genetic sequence alignmentcse

different dna sequences

different species

conserved segment