what are math and computer science doing in biology ?

Post on 12-Jan-2016

36 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

What are Math and Computer Science doing in Biology ?. Dan Gusfield UC Davis March 29, 2012 Denison University. One limited perspective. Short Answer:. Bioinformatics Computational Biology Statistical Biology Mathematical Biology …. Short Answer:. Bioinformatics - PowerPoint PPT Presentation

TRANSCRIPT

What are Math and Computer Science doing in Biology?

Dan Gusfield

UC Davis

March 29, 2012

Denison University

One limitedperspective

Short Answer:

• Bioinformatics

• Computational Biology

• Statistical Biology

• Mathematical Biology

• …..

Short Answer:

• Bioinformatics

• Computational Biology

• Statistical Biology

• Mathematical Biology

• …..

My focus

UC Davis6

computational biology–“An interdisciplinary field that applies the

techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia)

BiologyComputer Science

Math &Statistics

Computational biology, Bioinformatics

How can non-biologists,non-chemists understandor contribute to biology?

Where does our licensecome from?

My Fear 30 years ago was that I would first need to

master material like:

Citric Acid Cycle

Amylase + starch substrate

Bond representation of triplex DNA. This view is down the long axis. The “third” strand is colored.

MYOGLOBIN - An oxygen carrier in muscle

Here is another way of visualising tertiary the structure

Tertiary Stucture

Spot the Tertiary folding.

Quaternary Structure

Spot the Haem group

LYSOZYME

Including the Side chains.

Can you see any active site now?

It looked very daunting!

But,

By some wonderfulfact or fluke of nature,a huge simplification ispossible and veryproductive.

Molecular information is (partially) Digital.

And, nature takes notes (leaves historical footnotes).

PRIMARY STRUCTURE

This diagram shows the primary structure of PIG INSULIN, a protein hormone as discovered by Frederick Sanger.

He was given a Nobel prize in 1958.

Primary structure is described by the sequence of Amino Acids in the chain

Hemoglobin – Primary Structure

NH2-Val-His-Leu-Thr-Pro-Glu-Glu-Lys-Ser-Ala-Val-Thr-Ala-Leu-Trp-Gly-Lys-Val-Asn-Val-Asp-Glu-Val-Gly-Gly-Glu-…..

beta subunit amino acid sequence

It has been amazingly productive to treat protein and DNA

molecules just as text:collecting, comparing,

creating molecular sequences.

No hard-core chemistry orbiology - just text comparisonand analysis.

Fluke of nature?An imposition of the humanmind?Lucky break for us?

The first major success story:

Simian Sarcoma Virus onc Gene, v-sis isderived from the Gene (or Genes) of aPlatelet-Derived Growth Factor.R.F. Doolittle et al, Science 1983

“The transforming protein of aprimate sarcoma virus and aplatelet-derived growth factor arederived from the same or closelyrelated cellular genes. This conclusion is based on the demonstration of extensivesequence similarity.”

From the abstract

Sequence similarity suggestedthat genes involved in cancerwere functionally related to genesinvolved in blood platelet growth,two biological phenomena thathad previously seemed unrelated.

This was a very surprising result,and a novel kind of reasoning.But,

Biology via Sequence Analysisis now completely accepted, main-stream.

Some biologists have evenreplaced their wet-labs withcomputer labs, doing biologyonly by sequence analysis.

“The ultimate rational behind allpurposeful structures and behaviorof living things is embodied in thesequence of residues of nascentpolypeptide chains …” J. Monod

“The rosetta stone of modern biologyappears to be sequence comparitiveanalysis.” T. Smith

Success stories from sequence analysis are now routine. Why?

Mostly shared history and duplicationwith modification, but also shared physical, chemical constraints.

“We didn't know it at the time, but we found out everythingin life is so similar, that the same genes that work in flies are the ones that work in humans.”

Eric Wieschaus, co-winner of the 1995 Nobel prize in medicine

Take-home message

04/21/23UC Davis29

High sequence similarity implies significant functional and/or structural similarity

Ancestor

Species A

Species B

paralogs

orthologs

Can we reverse the statement?

04/21/23UC Davis31

Two sequences with high functional similarity should have similar sequences.

The success of sequence comparison and analysis, and thedevelopment of efficient DNAsequencing, has leadto huge projects to capture, accumulate, store, curate, and annotate bio-molecular sequences.

Genbank, Blast, Human GenomeProject, specialized databases.

Today it has around 300 trillion bases!

Examples of large-scale sequencing projects

1,000 Genomes Project. http://www.1000genomes.org/.

BGI, 10,000 whole human genomes.

BGI, 1,000 individuals with IQ>145 versus 1,000 random individuals.

BGI, Autism Genetic Resource Exchange, 10,000 individuals.

BGI, CHOP, many childhood diseases.

Genome Institute, Washington U. St. Louis, 600 childhood cancer patients;

$65 million over three years. 150 tumor & normal cancer genome pairs.

Epitwin: TwinsUK & BGI $30 million for epigenetic differences in 5,000 twins.

Netherlands Genome Project: BGI 750 genomes (250 trios) in Dutch biobanks.

Epi4K: Duke et al. $25M to sequence 4,000 genomes for epilepsy research.

U. Michigan Cancer Center: Clinical next-gen sequencing of cancer patients.

R. Michelmore

$1,000 ($100?) human genome coming => $1,000 genome for many animals and plants $100 genome for fungi $10 genome for bacteria en masse

Metagenomics: sequencing of communitiesbiomes (humans = 100x more bacteria)novel & unculturable organismscharacterization of diversity & unique genes

Not just genomic DNA sequence: DNA modificationsepigenomics & copy number variation (CNV)expression analysis (RNAseq not arrays)

Enormous amounts of sequence dataNeed for major data handling capabilitiesVital role for bioinformatics just to manage the data

In near future: DNA sequence = an inexpensive commodity generated on a variety of platforms

R. Michelmore

More recently: Metagenomics,metabolomics, proteomics,microbiomics, epigenomics,transcriptomics, methylomics….

High-throughput biology generatingmassive amounts of data; sometimes too large even to store.

NYT November 30, 2011:

“The Bejing Genome Center has enough sequencing capacity tosequence 2,000 human genomesper day.”

“World capacity is now 13 quadrillionDNA bases a year, an amount thatwould fill a stack of DVDs two mileshigh.”

OK, so sequences and sequence analysis are

important, but where’s the promised computer science

and math?

Simple sequence comparison,comparing new sequences againstsequences in databases, has beenextremely productive.

But how do we extract the mostbiological value from sequences?

The Larger Challenge and Opportunity: How to utilize the deluge of sequence data?

What significant patterns do you see in:

Making sense of the code

04/21/23UC Davis43

Damien Peltier

How do we know that patterns wesee are meaningful? How do weknow that similarities we see are based in biology and not justrandom happenstance?

Humans are good at seeingpatterns, even in random eventsand data.

How do we analyze so much data?

FromMars

From the bible code

What we need:

• Clear, biologically meaningful definitions of similarity, patterns. Biological models of mutation and evolution - how sequences evolve.

• Metrics - how similar, how good the fit.• Efficient methods to compute similarities, and

find patterns, and compute the metrics.• Efficient methods to assess the “significance”

of the finds.

For those tasks, we need

• Biology - to define and model meaningful types of similarities and patterns to look for.

• Mathematics - to propose and understand the models and metrics.

• Computer Science - for efficient sequence analysis and search algorithms.

• Statistics - to measure the ``significance” (deviation from random happenstance) of the finds.

UC Davis51

computational biology–“An interdisciplinary field that applies the

techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia)

BiologyComputer Science

Math &Statistics

Computational biology, Bioinformatics

“It costs more to analyze a genomethan to sequence a genome.”D. Haussler

A small part of the story in greater detail

Basic problem: define and compute the similarity of two

sequences

04/21/23UC Davis54

• Biological-Mathematical model: Two sequences are similar when…

• Algorithmic problem: How do you compute the sequence similarity of two sequences S1 and S2.

“All models are wrong, but someare useful.”

George Box

S1: AATCCAGTTTTACAGATCCTC length m=21

S2: AATAGTTTTACAGACTCAT length n=19

S1: - AATCCAGTTTTATAGA-TCCTC length m=23

S2: AATA—GTTTTACAGACTCAT-- length n=23

Match, Mismatch, Space, Gap

One measure of the goodness of the alignment is the (# of matches) -- (# of mismatches) --(# of spaces)

Alignment: Insert spaces into, or before or afterthe two sequences to make them the same length.

Modeling sequence evolution

Given a metric to measure the goodness of any specific alignment,we define the Similarity of twosequences S1 and S2 as:

The Maximum(# matches) -- (# mis) -- (# spaces)over all possible alignments ofS1 and S2.But how do we compute similarity?

Mathematics finds a formula:

So there are a huge number ofalignments.

Mathematics counts the number of alignments

04/21/23UC Davis59

Length of thesequences

Number of alignments

10 184,756

20 ~1.4e11

100 ~9.0e58

There are too many alignmentsto try each one out, but clever,efficient algorithms, using thetechnique of Dynamic Programming,allow the efficient computation ofsimilarity. (Computer Sciencecontribution).

For any length n, the number ofoperations needed to computethe similarity of two n-lengthsequences, via Dynamic Programming, is proportional ton squared (i.e, n^2).

Number of operations needed to compute Similarity

Length of thesequences

Number of operations using explicit enumeration

Number ofoperations using Dynamic Programming

10 184,756 100

20 ~1.4e11 400

100 ~9.0e58 10e4

So similarity can be found quickly, but

Elegant statistical methods can be used to determine the probability that two random sequences wouldhave that level of similarity or more.

We don’t reject the possibility that two sequences are similar due only to chance, unless the computed probability is very low.

Is the similarity significant?

Extensions: Finding patterns in multiple sequences

ACTAACCGGGAGATTTCAGA human

AAGTTCCGGGAGATTTCCA chimp

TAGTTATCCGGGAGATTAGA mouse

AAAACCGGTAGATTTCAGG rat

Multiple Sequence Alignment

AC--TAACCGGGAGATTTCAGA human

AAGTT--CCGGGAGATTTCC-A chimp

TAGTTATCCGGGAGATT--AGA mouse

AA---AACCGGTAGATTTCAGG rat

CLUSTALW multiple sequence alignment (rbcS gene)

Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC

Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA

Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA

Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

Again we need a model of whatmultiple sequence alignmentsare biologically meaningful;a metric to score the goodness of a multiple alignment; an algorithm to compute multiple alignments, based on the metric;and statistical methods to evaluatethe signifinance of an alignment.

Summarizing• Biology by sequence analysis opens the door

widely to non-biologists.• Models of sequence evolution and metrics used in

sequence analysis are articulated by biology and Mathematics.

• Computer Science contributes efficient algorithms to do the analysis and compute the metrics.

• Statistics is needed to evaluate the significance of the computed results.

• Sequence analysis is just one of many ways that computer science and mathematics have entered biology.

In general: The computational- biology work flow

04/21/23UC Davis69

Biological Knowledge

E.g. assumption about mutation distribution or preferential attachment

E.g. Given the mathematical model, find spots where mutation rates are high or low in a statistically significant way

Biological model

Mathematical model and assumptions

Mathematical problem

Algorithmic problem E.g. What algorithm should I develop to efficiently find hotspots

Programming problem

E.g. Data storage, Memory, OOP and languages, optimizations, GUI

E.g. DNA mutates

Eg. DNA replication infidelity model, mutagens, radiation models etc

Another illustrastion, involvingphylogenetic trees rather than sequences.

Comparing Trees: Tanglegrams

• A Tanglegram is a pair of phylogenetic trees drawn in the plane with no crossing edges, with the same labeled leaf set. The leaves of one tree are displayed on a line, and the leaves of the other tree are displayed on a parallel line.

• One tree represents the evolution of a set of species, and the other tree represents the evolution of a set of parasites that inhabit the species.

• A straight line connect each leaf in one tree to the leaf with the same label in the other tree.

• The number of crossing lines is a measure of the similarity of the trees.

• A small measure suggests that the species and parasites co-evolved.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Images courtesy of NTBG

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Images courtesy of NTBG

So we have the algorithmicproblem of finding planar layouts of the two trees, to minimize thenumber of crossings of the linesbetween the leaves. That minimum number is the metric of similarity. How do we compute it, and how can we evaluate significance?

But the trees can be redrawn toreduce the number of crossings.

Thank you

top related