bioinformatics and gene discovery

BIOINFORMATICSBIOINFORMATICSAND

GENE DISCOVERYGENE DISCOVERY

BIOINFORMATICSBIOINFORMATICSAND

GENE DISCOVERYGENE DISCOVERY

Iosif Vaisman

1998

UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL

Bioinformatics Tutorials

From genes to proteins

DNA

RNA

mRNA

TRANSCRIPTION

SPLICING

PROMOTERELEMENTS

PROTEIN

TRANSLATION

STARTCODON

STOPCODON

SPLICESITES


Comparative Sequence Sizes

• Yeast chromosome 3 350,000

• Escherichia coli (bacterium) genome 4,600,000

• Largest yeast chromosome now mapped 5,800,000

• Entire yeast genome 15,000,000

• Smallest human chromosome (Y) 50,000,000

• Largest human chromosome (1) 250,000,000

• Entire human genome 3,000,000,000

Low

-res

olut

ion

phys

ical

map

of

chr

omos

ome

19

Chr

omos

ome

19 g

ene

map

Computational Gene Prediction

•Where the genes are unlikely to be located?

•How do transcription factors know where to bind a region of DNA?

•Where are the transcription, splicing, and translation start and stop

signals?

•What does coding region do (and non-coding regions do not) ?

•Can we learn from examples?

•Does this sequence look familiar?

Artificial Intelligence in Biosciences

Neural Networks (NN)

Genetic Algorithms (GA)

Hidden Markov Models (HMM)

Stochastic context-free grammars (CFG)

Information Theory

0 1

1 bit

Information Theory

00 01

1 bit

1 bit

1110

Information Theory

1 bit

1 bit

Scientific Models

Mechanistic models

Predictive powerElegance

Consistency

Stochastic models

Predictive power

Hidden Markov models

Mechanism Black box

Stochastic mechanism

Physical models -- Mathematical models

Neural Networks•interconnected assembly of simple processing elements (units or nodes)•nodes functionality is similar to that of the animal neuron •processing ability is stored in the inter-unit connection strengths (weights)•weights are obtained by a process of adaptation to, or learning from, a set

of training patterns

Genetic AlgorithmsSearch or optimization methods using simulated evolution.

Population of potential solutions is subjected to natural selection, crossover, and mutation

choose initial populationevaluate each individual's fitnessrepeat

select individuals to reproducemate pairs at randomapply crossover operatorapply mutation operatorevaluate each individual's fitness

until terminating condition

Crossover

Child AB

Child BA

Parent A

Parent B

crossover point

Mutation

Markov Model (or Markov Chain)

A GATCT

Probability for each character based only on several preceding characters in the sequence

# of preceding characters = order of the Markov Model

Probability of a sequence

P(s) = P[A] P[A,T] P[A,T,C] P[T,C,T] P[C,T,A] P[T,A,G]

Hidden Markov Models

States -- well defined conditionsEdges -- transitions between the states

A

T

C

G

T

A C

ATGACATTACACGACACTAC

Each transition asigned a probability.

Probability of the sequence:single path with the highest probability --- Viterbi pathsum of the probabilities over all paths -- Baum-Welch method

Hidden Markov Model of Biased Coin Tosses

• States (Si): Two Biased Coins {C1, C2}

• Outputs (Oj): Two Possible Outputs {H, T}

• p(OutputsOij): p(C1, H), p(C1, T), p(C2, H) p(C2, T)

• Transitions: From State X to Y {A11, A22, A12, A21}

• p(Initial Si): p(I, C1), p(I, C2)

• p(End Si): p(C1, E), p(C2, E)

Hidden Markov Model for Exon and Stop Codon (VEIL Algorithm)

GRAIL gene identification program

POSSIBLE EXONSREFINED EXON

POSITIONSFINAL EXON CANDIDATES

Suboptimal Solutions for the Human Growth Hormone Gene (GeneParser)

Measures of Prediction Accuracy

TN FPFN TN TNTPFNTP FN

REALITY

PREDICTION

PR

ED

ICT

ION

REALITY

TP

FN TN

FP

c

cnc

ncSn = TP / (TP + FN)

Sp = TP / (TP + FP)

Sensitivity

Specificity

Nucleotide Level

Measures of Prediction Accuracy

REALITY

PREDICTION

Exon Level

WRONGEXON

CORRECTEXON

MISSINGEXON

Sn =Sensitivitynumber of correct exonsnumber of actual exons

Sp =Specificitynumber of correct exons

number of predicted exons

GeneMark Accuracy Evaluation

Gene Discovery Exercisehttp://metalab.unc.edu/pharmacy/Bioinfo/Gene

Bibliographyhttp://linkage.rockefeller.edu/wli/gene/list.html

andhttp://www-hto.usc.edu/software/procrustes/fans_ref/

bioinformatics and gene discovery

Documents

largest human chromosome

markov chainagatctprobability

markov modelprobability

largest yeast chromosome

ehidden markov model

smallest human chromosome

entire human genome

ghidden markov modelsstates