tutorial 2: some problems in bioinformatics 1. alignment pairs of sequences

Tutorial 2: Some problems in bioinformatics

1. Alignment pairs of sequencesDatabase searching for sequencesMultiple sequence alignmentProtein classification

2. Phylogeny prediction (tree construction)

Sources:1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount. 2001. Cold Spring Harbor Press2) NCBI tutorial http://www.ncbi.nlm.nih.gov/Education/ andhttp://www.ncbi.nih.gov/BLAST/tutorial/Altschul-1.html3) Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics

Alignment: pairs of sequences

DNA: A, G, C, T

protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y

KQTGKG| |||KSAGKG

TCGCA|| ||TC-CA

DNA to RNA to protein to phenotype


Concepts:

SimilarityIdentityHomologyOrthologyParalog KQTGKGV

| |||:KSAGKGL

4/7 identical5/7 similar

Homology is based on evolutionary history

Figure 45 Lineage-specific expansions of domains and architectures of transcription factors. Top, specific families of transcription factors that have been expanded in each of the proteomes. Approximate numbers of domains identified in each of the (nearly) complete proteomes representing the lineages are shown next to the domains, and some of the most common architectures are shown. Some are shared by different animal lineages; others are lineage-specific.

A partial alignment of globin sequences.

Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.

- Fitch, W.M. 2001. Homology: A personal view of some of the problems.Trends Genet. 16: 227-231.

Homology, orthology and paralogy

orthologs diverged at a speciation event

paralogs diverged at a gene duplication event


Scoring schemes

Score = matches - mismatches - gaps

GKG-RRWDAKR||| ||GKGAKRWESAP

What is the best way to evaluate the contribution of each?

A partial alignment of globin sequences from Pfam.

Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.


Global vs. local alignment.(end gaps are ignored in local alignment)

Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html

Dynamic programming

TCGCA|| ||TC-CA

Dynamic programming

Brian Fristensky. Univ. of Manitobahttp://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html


Scoring schemes

Score = matches - mismatches - gaps

GKG-RRWDAKR||| ||GKGAKRWESAP

"The dynamic programming algorithm was improved in performance by Gotoh (1982) by

using the linear relationship for a gap weight wx = g + rx, where the weight for a gap of length x is the sum of a gap opening penalty (g) and a gap extension penalty (r) times the gap length (x), and by simplifying the dynamic programming algorithm."

D. W. Mount

KQTGKG-RRWDAKR| ||| |||KSAGKG-----AKR

VS.

Alignment: amino acid substitution matrices

Scoring schemes

"Any [scoring] matrix has an implicit amino acid pair frequency distribution that characterizes the alignments it is optimized for finding. More precisely, let p i be the frequency with which amino acid i occurs in protein sequences and let q ij be the freqeuncy with which amino acids i and j are aligned within the class of alignments sought. Then, the scores that best distinguish these alignments from chance are given by the formula:

Sij = log (qij / pipj)

The base of the logarithm is arbitrary, affecting only the scale of the scores. Any set of scores useful for local alignment can be written in this form, so a choice of substitution matrices can be viewed as an implicit choice of 'target frequencies'"

- Altschul et al. 1994 (Nature Genetics 6:119)

Those frequencies are characteristic of the sequences being aligned, and are primarily a function of their degree of divergence.


Substitution matrices -- BLOSUM 62

Henikoff and Henikoff. 1992.Amino acid substitution matrices from protein blocks.PNAS 89: 10915-10919.


Substitution matrices -- BLOSUM 62

Alignment: implementations

FastaIntroduces the concept of k-tuple perfects alignment

to seed longer global alignments.

BLAST -- Basic Local Alignment Search ToolInitiates an alignment locally and then extends that

alignment.

GKG|||GKG

GKG-RRW||| ||GKGAKRW

Alignment: Searching databases for sequences

There are many modifications of BLAST for specific purposes.

The NCBI BLAST interface

Extreme value distributionthe expected distribution of the maximum of many independent random variables, generally Y = exp [-x -e-x ]

K and lambda are statistical parameters dependent upon the scoring system and the background amino acid frequencies of the sequences being compared. While FASTA estimates these parameters from the scores generated by actual database searches, BLAST estimates them beforehand for specific scoring schemes by comparing many random sequences generated using a standard protein amino acid composition [12].

Fasta can be run at EMBL.The software is also available for download.

Alignment: Multiple sequence alignment

Alignment: Protein classification

Phylogeny prediction (tree construction)

Phylogeny prediction (tree construction)Character-based Methods

Parsimony

Maximum Likelihoodtree that maximizes the likelihood of seeing the data

Bayesian Analysistrees with greatest likelihoods given the data

Distance Methods

Unweighted Gap-pair method with Arithmetic Means

Neighbor joining

a,The interspecies relationships of five chromosome regions to corresponding DNA sequences in a chimpanzee and a gorilla. Most regions show humans to be most closely related to chimpanzees (red) whereas a few regions show other relationships (green and blue). b, The among-human relationships of the same regions are illustrated schematically for five individual chromosomes.

Within- and between-species variation along a single chromosome.

Tutorial III: Open problems in bioinformaticsTentatively:

Detection of subtle signalspromoter elementsexon splicing enhancersnoncoding RNAsweak protein similarities

MicroarraysProtein folding and homology modeling

Thursday, June 10, 2:00 - 3:45

Microarray expression data

Statistical analysis -- what has changed

Clustering -- which genes change together

Clustering -- promoter recognition

Clustering -- database integration

Phenotype determination (e.g. cancer prognosis)

Tutorial 2: Some problems in bioinformatics

1. Alignment pairs of sequencesMultiple sequence alignmentDatabase searching for sequencesProtein classification

2. Phylogeny prediction (tree construction)

3. microarray expression data

4. Protein structureProtein foldingStructure predictionHomology modeling

Sources:1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount. 2001. Cold Spring Harbor Press2) NCBI tutorial http://www.ncbi.nlm.nih.gov/Education/3) Cold Spring Harbor course in Computational Genomics (1999) Pearson

tutorial 2: some problems in bioinformatics 1. alignment pairs of sequences

Documents