bioinformatics and gene discovery

27
BIOINFORMATICS BIOINFORMATICS AND GENE DISCOVERY GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials

Upload: cinderella-rufus

Post on 01-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL. Bioinformatics Tutorials. BIOINFORMATICS AND GENE DISCOVERY. Iosif Vaisman. 1998. From genes to proteins. From genes to proteins. DNA. PROMOTER ELEMENTS. TRANSCRIPTION. RNA. SPLICE SITES. SPLICING. mRNA. START CODON. STOP CODON. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BIOINFORMATICS AND GENE DISCOVERY

BIOINFORMATICSBIOINFORMATICSAND

GENE DISCOVERYGENE DISCOVERY

BIOINFORMATICSBIOINFORMATICSAND

GENE DISCOVERYGENE DISCOVERY

Iosif Vaisman

1998

UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL

Bioinformatics Tutorials

Page 2: BIOINFORMATICS AND GENE DISCOVERY
Page 3: BIOINFORMATICS AND GENE DISCOVERY

From genes to proteins

Page 4: BIOINFORMATICS AND GENE DISCOVERY

DNA

RNA

mRNA

TRANSCRIPTION

SPLICING

PROMOTERELEMENTS

PROTEIN

TRANSLATION

STARTCODON

STOPCODON

SPLICESITES

From genes to proteins

Page 5: BIOINFORMATICS AND GENE DISCOVERY

From genes to proteins

Page 6: BIOINFORMATICS AND GENE DISCOVERY

Comparative Sequence Sizes

• Yeast chromosome 3 350,000

• Escherichia coli (bacterium) genome 4,600,000

• Largest yeast chromosome now mapped 5,800,000

• Entire yeast genome 15,000,000

• Smallest human chromosome (Y) 50,000,000

• Largest human chromosome (1) 250,000,000

• Entire human genome 3,000,000,000

Page 7: BIOINFORMATICS AND GENE DISCOVERY

Low

-res

olut

ion

phys

ical

map

of

chr

omos

ome

19

Page 8: BIOINFORMATICS AND GENE DISCOVERY

Chr

omos

ome

19 g

ene

map

Page 9: BIOINFORMATICS AND GENE DISCOVERY

Computational Gene Prediction

•Where the genes are unlikely to be located?

•How do transcription factors know where to bind a region of DNA?

•Where are the transcription, splicing, and translation start and stop

signals?

•What does coding region do (and non-coding regions do not) ?

•Can we learn from examples?

•Does this sequence look familiar?

Page 10: BIOINFORMATICS AND GENE DISCOVERY

Artificial Intelligence in Biosciences

Neural Networks (NN)

Genetic Algorithms (GA)

Hidden Markov Models (HMM)

Stochastic context-free grammars (CFG)

Page 11: BIOINFORMATICS AND GENE DISCOVERY

Information Theory

0 1

1 bit

Page 12: BIOINFORMATICS AND GENE DISCOVERY

Information Theory

00 01

1 bit

1 bit

1110

Page 13: BIOINFORMATICS AND GENE DISCOVERY

Information Theory

1 bit

1 bit

Page 14: BIOINFORMATICS AND GENE DISCOVERY

Scientific Models

Mechanistic models

Predictive powerElegance

Consistency

Stochastic models

Predictive power

Hidden Markov models

Mechanism Black box

Stochastic mechanism

Physical models -- Mathematical models

Page 15: BIOINFORMATICS AND GENE DISCOVERY

Neural Networks•interconnected assembly of simple processing elements (units or nodes)•nodes functionality is similar to that of the animal neuron •processing ability is stored in the inter-unit connection strengths (weights)•weights are obtained by a process of adaptation to, or learning from, a set

of training patterns

Page 16: BIOINFORMATICS AND GENE DISCOVERY

Genetic AlgorithmsSearch or optimization methods using simulated evolution.

Population of potential solutions is subjected to natural selection, crossover, and mutation

choose initial populationevaluate each individual's fitnessrepeat

select individuals to reproducemate pairs at randomapply crossover operatorapply mutation operatorevaluate each individual's fitness

until terminating condition

Page 17: BIOINFORMATICS AND GENE DISCOVERY

Crossover

Child AB

Child BA

Parent A

Parent B

crossover point

Mutation

Page 18: BIOINFORMATICS AND GENE DISCOVERY

Markov Model (or Markov Chain)

A GATCT

Probability for each character based only on several preceding characters in the sequence

# of preceding characters = order of the Markov Model

Probability of a sequence

P(s) = P[A] P[A,T] P[A,T,C] P[T,C,T] P[C,T,A] P[T,A,G]

Page 19: BIOINFORMATICS AND GENE DISCOVERY

Hidden Markov Models

States -- well defined conditionsEdges -- transitions between the states

A

T

C

G

T

A C

ATGACATTACACGACACTAC

Each transition asigned a probability.

Probability of the sequence:single path with the highest probability --- Viterbi pathsum of the probabilities over all paths -- Baum-Welch method

Page 20: BIOINFORMATICS AND GENE DISCOVERY

Hidden Markov Model of Biased Coin Tosses

• States (Si): Two Biased Coins {C1, C2}

• Outputs (Oj): Two Possible Outputs {H, T}

• p(OutputsOij): p(C1, H), p(C1, T), p(C2, H) p(C2, T)

• Transitions: From State X to Y {A11, A22, A12, A21}

• p(Initial Si): p(I, C1), p(I, C2)

• p(End Si): p(C1, E), p(C2, E)

Page 21: BIOINFORMATICS AND GENE DISCOVERY

Hidden Markov Model for Exon and Stop Codon (VEIL Algorithm)

Page 22: BIOINFORMATICS AND GENE DISCOVERY

GRAIL gene identification program

POSSIBLE EXONSREFINED EXON

POSITIONSFINAL EXON CANDIDATES

Page 23: BIOINFORMATICS AND GENE DISCOVERY

Suboptimal Solutions for the Human Growth Hormone Gene (GeneParser)

Page 24: BIOINFORMATICS AND GENE DISCOVERY

Measures of Prediction Accuracy

TN FPFN TN TNTPFNTP FN

REALITY

PREDICTION

PR

ED

ICT

ION

REALITY

TP

FN TN

FP

c

cnc

ncSn = TP / (TP + FN)

Sp = TP / (TP + FP)

Sensitivity

Specificity

Nucleotide Level

Page 25: BIOINFORMATICS AND GENE DISCOVERY

Measures of Prediction Accuracy

REALITY

PREDICTION

Exon Level

WRONGEXON

CORRECTEXON

MISSINGEXON

Sn =Sensitivitynumber of correct exonsnumber of actual exons

Sp =Specificitynumber of correct exons

number of predicted exons

Page 26: BIOINFORMATICS AND GENE DISCOVERY

GeneMark Accuracy Evaluation

Page 27: BIOINFORMATICS AND GENE DISCOVERY

Gene Discovery Exercisehttp://metalab.unc.edu/pharmacy/Bioinfo/Gene

Bibliographyhttp://linkage.rockefeller.edu/wli/gene/list.html

andhttp://www-hto.usc.edu/software/procrustes/fans_ref/