it & health 2009 summary

31
It & Health 2009 Summary Thomas Nordahl Petersen

Upload: levi-aguilar

Post on 30-Dec-2015

30 views

Category:

Documents


0 download

DESCRIPTION

It & Health 2009 Summary. Thomas Nordahl Petersen. Teachers. Bent Petersen. Thomas Nordahl Petersen. Ramneek Gupta. Rasmus Wernersson. Lisbeth Nielsen Fink. Thomas Blicher. Anders Gorm Pedersen. Outline of the course. Topics will cover a general introduction to bioinformatics Evolution - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: It & Health 2009 Summary

It & Health 2009Summary

Thomas Nordahl Petersen

Page 2: It & Health 2009 Summary

Teachers

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Thomas Nordahl Petersen

Rasmus Wernersson

Lisbeth Nielsen Fink

Anders Gorm Pedersen

Bent Petersen

Ramneek Gupta

Thomas Blicher

Page 3: It & Health 2009 Summary

Outline of the course

• Topics will cover a general introduction to bioinformatics– Evolution– DNA / Protein– Alignment and scoring matrices

• How does it work & what are the numbers

– Visualization of multiple alignments• Phylogenetic trees and logo plots

– Commonly used databases• Uniprot/Genbank & Genome browsers

– Protein 3D-structure– Artificial neural networks & case stories– Practical use of bioinformatics tools

• Preparation for exam

Page 4: It & Health 2009 Summary

Topics covered - (some of them)

Page 5: It & Health 2009 Summary

Information flow in biological systems

Page 6: It & Health 2009 Summary

Amino Acids

Amine and carboxyl groups. Sidechain ‘R’ is attached to C-alpha carbon

The amino acids found in Living organisms are L-amino acids

Page 7: It & Health 2009 Summary

Amino Acids - peptide bond

N-terminal C-terminal

Page 8: It & Health 2009 Summary

1 and 3-letter codes

1.There are 20 naturally occurring amino acids2.Normally the one/three codes are used

Ala - ACys - CAsp - DGlu - EPhe - FGly - GHis - HIle - ILys - KLeu - L

Met - MAsn - NPro - PGln - QArg - RSer - SThr - TVal - VTrp - WTyr - Y

Page 9: It & Health 2009 Summary

CE

NT

ER

FO

R B

IOLO

GIC

AL

SE

QU

EN

CE

AN

ALY

SIS

Theory of evolution

Charles DarwinCharles Darwin1809-18821809-1882

Page 10: It & Health 2009 Summary

Phylogenetic tree

Page 11: It & Health 2009 Summary

Global versus local alignments

Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).

Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).

Global alignment

Seq 1

Seq 2

Local alignment

Page 12: It & Health 2009 Summary

Pairwise alignment: the solution

”Dynamic programming” (the Needleman-Wunsch algorithm)

Page 13: It & Health 2009 Summary

Sequence alignment - Blast

Page 14: It & Health 2009 Summary

Sequence alignment - Blast

Page 15: It & Health 2009 Summary

Blosum & PAM matrices

• Blosum matrices are the most commonly used substitution matrices.

• Blosum50, Blosum62, blosum80• PAM - Percent Accepted Mutations• PAM-0 is the identity matrix.• PAM-1 diagonal small deviations from 1, off-

diag has small deviations from 0• PAM-250 is PAM-1 multiplied by itself 250

times.

Page 16: It & Health 2009 Summary

Sequence profiles (1J2J.B)

>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK

Page 17: It & Health 2009 Summary

Log-odds scores

• BLOSUM is a log-likelihood matrix:• Likelihood of observing j given you have i is

– P(j|i) = Pij/Pi

• The prior likelihood of observing j is– Qj , which is simply the frequency

• The log-likelihood score is– Sij = 2log2(P(j|i)/log(Qj) = 2log2(Pij/(QiQj))– Where, Log2(x)=logn(x)/logn(2) – S has been normalized to half bits, therefore the factor 2

Page 18: It & Health 2009 Summary

BLAST Exercise

Page 19: It & Health 2009 Summary

Genome browsers - UCSC

Intron - Exon structure

Single Nucleotide polymorphism - SNP

Page 20: It & Health 2009 Summary

SNPs

Page 21: It & Health 2009 Summary

Protein 3D-structure

Page 22: It & Health 2009 Summary

Protein structure

Primary structure: Amino acids sequences

Secondary structure: Helix/Beta sheet

Tertiary structure: Fold, 3D cordinates

Page 23: It & Health 2009 Summary

Protein structure-helix

helix 3 residues/turn - few, but not uncommon-helix 3.6 residues/turn - by far the most common helixPi-helix 4.1 residues/turn - very rare

Page 24: It & Health 2009 Summary

Protein structurestrand/sheet

Page 25: It & Health 2009 Summary

Protein folds

Class4’th is ‘few secondary structure

ArchitectureOverall shape of a domain

TopologyShare secondary structure connectivity

Page 26: It & Health 2009 Summary

Protein 3D-structure

Page 27: It & Health 2009 Summary

Neural NetworksFrom knowledge to information

Protein sequence Biological feature

Page 28: It & Health 2009 Summary

• A data-driven method to predict a feature, given a set of training data

• In biology input features could be amino acid sequence or nucleotides

• Secondary structure prediction

• Signal peptide prediction

• Surface accessibility

• Propeptide prediction

Use of artificial neural networks

N C

Signalpeptide

Propeptide Mature/active protein

Page 29: It & Health 2009 Summary

Prediction of biological featuresSurface accessible

QuickTime™ and a decompressor

are needed to see this picture.

Predict surface accessible fromamino acid sequence only.

Page 30: It & Health 2009 Summary

Logo plots

Information content, how is it calculated - what does it mean.

Page 31: It & Health 2009 Summary

Logo plots - Information Content

Sequence-logo

Calculate Information Content

I = apalog2pa + log2(4), Maximal value is 2 bits

• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.

~0.5 each

Completely conserved