Opportunities of Systems Engineering/Operations Research
in BioinformaticsHyoungtae Kim
(Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU)
2
Outline
• Introduction on Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research for Bioinformatics
• About Funding Opportunities
• Conclusions
3What is Bioinformatics/Computational Molecular Biology?
• An application of mathematical, statistical, and computational tools in the analysis of the huge size biological data
• Most of the cases, it involves analyzing information stored in large databases
• Multi-disciplinary: -Biology -Mathematics -Statistics-Physics -Chemistry -Computer Science-Engineering
It has not yet found its own natural home department
4
Why Bioinformatics?
• Current data analysis tools are far from being efficient for analyzing vast amount of biological data
• The pace of biological understanding is much slower than the pace of the technology advance that have powered experimental discovery and data collection
• Benefits: Advances in detection and treatment of disease and the production of genetically engineered foods
Profound impact on health and medicine
5Three Elements of Bioinformatics Research
Significant Biological problems-Gene, motif, signal recognition
-Protein structure prediction-Metabolic pathway deduction
-Etc.
Data-Microarrays
-Mass Spectroscopy-Etc.
Theory & Methods-Algorithms
-Statistical Methods-Ontologies
-Etc.
6
Prerequisites of Bioinformatics
• Basic knowledge in Molecular Biology– Prokaryotic and Eukaryotic cells– Genes, Codons, DNA, RNA, Central dogma of biology– Etc.
• Computing Skills– Program Languages: Python, Perl, Java, etc.– Knowledge in Relational Databases, etc.
• Other Skills– General Statistical Knowledge– Optimization Tools: Math Programming, Network
Optimization, etc.
Scientific Mind
++
7
• DNA and Protein Sequence Analysis• Gene Finding and Prediction• Etc.
• Microarray Experiment and Data Analysis• Protein Structure Prediction • Deduction of Metabolic Pathways• And more…
Standard ProblemsStandard Problems
Emerging ProblemsEmerging Problems
Various Problems in Bioinformatics
8
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research for Bioinformatics
• About Funding Opportunities
• Concluding Remarks
9Paradigm Shift in Biology
– Goal of the HGP = sequencing of the human genome
– Hypothesis driven reductionism discovery science approach
– Drive-forced the development of high throughput technologies and computer applications to transmit, analyze, and model very large size data sets
• The Human Genome Project (HGP)– Working Draft of the human genome (2001)
10
Paradigm Shift in Biology
– Microarrays – allow the expression of thousands of genes to be surveyed at one time
– Protein Arrays – can examine all proteins in a cell and check if they are interacting under designed conditions
– Mass Spectrometry – The basic modality is protein mass fingerprinting
• High-throughput Technologies
11
Paradigm Shift in Biology
– Allows data collection in high-throughput manner – Can put all genes in a microbe on a chip
• Microarray Chip Technology
Similarity Matrix
Or
Distance Matrix
samples
genes
“Raw Data”
Expression Level
genes
genes
• Interpretation of the data is very challenging
12
patients with or without cancer
genes
253x15154 Microarray Gene Expression Data: 162 cancer vs 91 normal patients
13
Paradigm Shift in Biology
Genes and proteins
Gene activity data Protein-protein interaction data
Protein structure data Proteomic data
Metabolite dataRegulatory elements
Blackbox
Gigantic amount of biological information is hidden in these data and their inter-data relationship!
14
Paradigm Shift in Biology
• Concept of Systems Biology
– The Reductionist paradigm has been phenomenally successful in biology since 1950’s
– Genomics era exhaustive lists of biological parts (i.e. genes and proteins) together with their functional characteristics
– A System-level perspective is required to make sense of how all of these individual parts emerge and act collectively to perform a biological function
15
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research tools for Bioinformatics
• About Funding Opportunities
• Concluding Remarks
16
Systems Engineering/Operations Research tools
• Three CategoriesStochastics
–Hidden Markov Models
–MCMC
–Simulation Models
–Etc.
Network & Optimization
–Combinatorial
–Integer Programming
–Dynamic Programming
–Network Optimization
–Minimum Spanning Tree
–Etc.
Statistics
–MLE
–Regression
–Sampling
–Linear Model
–Cross Validation
–Statistical Estimation and Test
–Multivariate Analysis (or ANOVA)
–Wavelet Transformation
–Bayesian Networks, Etc.
17
Systems Engineering tools for Bioinformatics
• Hidden Markov Model for Gene Finding• Dynamic Programming for Sequence Alignment• Integer Programming for Protein Folding• Minimum Spanning Tree approach to Clustering
for Motif Identification (Xu et al. (2001)
• And many more …
• Some Examples
18
• Identification of Transcription Factor Binding Sites(Motifs)
• A gene’s transcriptional level is regulated by proteins (transcription factors), which bind to specific sites in the gene’s promoter region, called binding sites
• The binding-site identification problem is to find short “conserved” fragments, from a set of genomic sequences
Features of transcription factor binding site
1. These short DNA fragments in the upstream regions of genes are generally very similar to each other
2. Relatively high frequencies compared to other sequence fragments
A Significant Biological Problem
19
• Data Set (D)= Set of All Short DNA fragments in the upstream regions of genes
• Microarray gene expression technologies allow simultaneous view of the transcription levels of many thousands of genes under various cellular conditions
Data Collection
Experiment Find group of genes having
correlated expression
profiles
Upstream regions of genes
GATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACG
CCATCTCTACTAAAAATAGGAAATTCACCTGGTGGCAGGT
CCAGCTACTCGGGAGGCTGAGGCAGAAGAATCGCTTGAAT
GAGATTGCACTGAGCTGAGATCACGCCACTGCGCTCCAGC
GAGCAAGACTCCATAAAAAAAAAAATTATAACCTAATGAT
AGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGC
ACAGCTACTGCAAACAACCATGATGATGATACATCTCTTG
GGTTGTTTGAGACACATTCTATGCTCCTTGATTTGATTGG
GGTTCCTTGGGGACTTGGAGGTGACGAAAGCCTCCCTGGG
ACCTTCACTTCTCTAATATCAAGCTTCAGCAACCTGCTCC
CAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAG
CACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGAC
CACCCCAGCCAGACTCTGACTTCACTCCCGGCAGGTTTCA
CAGCAGTTGGAGCGAGCTGGCTTCTTGCGGTAGGCAGCCA
GCTCCCAATAGTCCTCGTTTCCTGGTAATCTCATGCTTGG
20
• Some testing data sets are available on the internet or in the literature
For example • CRP binding sites: 18 sequences with 105 BPs• Yeast binding sites: 8 sequences with 1000 BPs• Human binding sites: 113 sequences with 30 BPs
21
CRP binding sites: 18 sequences with 105 BPs
A C T G
22
• Traditional approaches
• Various sampling techniques including Gibbs sampling• EM Algorithm• Greedy Algorithm• Multi-Order Markov Chain Algorithm
All these are heuristic algorithms so this problem remains as a challenging and unsolved problem
Theory & Methods
23
• Input = A graph, G = (V,E), with weighted edges• Output = the cheapest subset of edges that keeps the gra
ph in one connected component
Brief Review: Minimum Spanning Tree
• Two Popular Algorithms
• Kruskal’s Algorithm • Prim’s Algorithm
24
• Minimum Spanning Tree approach
• Step1: Define a distance measure () on the data set (D), and compute distances b/w each pair of data points (i.e., (A,B) for all A, B in D)• Higher the sequence similarity b/w two fragments, smalle
r the distance is b/w their mapped positions
Theory & Methods
25
• Minimum Spanning Tree approach
• Step2: Find the MST ,T, representing D with its edge weight defined by and treat it as a data clustering problem
e1
e2
e3
4 Clusters, c1~c4, are identifiedRemove three edges e1,e2,e3
Theory & Methods
c1c1
c2c2
c3c3
c4c4T
26
• Comparison with Other Methods
• MST is based on a combinatorial approach
can identify all clusters of possible binding sites
• While existing heuristic methods are likely to miss some clusters
• Implemented result is at least as good as results by other methods
• While Simple structure of a tree facilitates efficient implementations of rigorous algorithm
Evaluation of the MST Method
27
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research tools for Bioinformatics
• About Funding Opportunities
• Concluding Remarks
28
DoD 11%
DOE 10%
NASA2 10%
NSF 7%
USDA3%
HHS 52%
All other agencies
7%
Funding Overviews by Funding Institutions(Top)/Field of Research(Bot)
Total of $54.1 billion in FY2004
54%
17%
10%
7%
5% 2% 2% 3%
Life scienceEngineering
Physical science
Environmental science
$9.1 billion $29.3 billion
Percentage of Total Federal Funding: Preliminary 2004 StatisticsSource: National Science Foundation/Division of Science Resources Statistics, Survey of Federal Funds for Research
29
How to Search for Funding Opportunities?
• NIH Computer Retrieval of Information on Scientific Projects (CRISP)• http://crisp.cit.nih.gov
• NIH Office of Extramural Research (OER)• http://grants1.nih.gov
• Other Websites• http://www.grants.gov• http://fedgrants.gov• http://www.nsf.gov/pubsys/ods/index.html
30
Growing Opportunities in Bioinformatics
New Grants for Bioinformatics
170191
300
0
50
100
150
200
250
300
350
2002 2003 2004
Fiscal Year
# N
ew
Gra
nts
From CRISP Search Data
31
# NIH Grants in Bioinformatics, 826
Microarray, 214 grants
Cancer,63 grants
Systems Biology, 80 grants
NIH Funded Projects in 2004
From CRISP Search Data
• Searched all Related Institutes, Centers, and States for the 2004 Fiscal Year
32
NIH Funding Opportunities for 2004 ~
• 2004 Program Announcement (PA)• Total 171 PAs
• Larger variety of topics• Cancer most prevalent topic• Many wish to have “multidisciplinary” outlook on topics
• 2005 Requests For Application (RFA)• Total 68 RFAs• Although listed for 2005, some application deadlines have passed• 2 directly related to bioinformatics• Cancer still most prevalent topic
From http://grants1.nih.gov
33
Outline
• Introduction to Bioinformatics
• Paradigm Shift in Biology
• Systems Engineering/Operations Research for Bioinformatics
• About Funding Opportunities
• Conclusions
34
Developing Potential Research Plans
1. Systems Engineers/Operations Research Society already have tools to solve various bioinformatics problems
2. Moneys are there to support your research
• Two Takeaways
Then, what do we need to start?
Biological Problems to solve
35
Concluding Remarks!!Concluding Remarks!!
• The main driving force of bioinformatics/computational biology is the high-throughput data production
• I.E. tools together with computing power can play an important role in this process
• Funding opportunities in this area are very rich
36
Any Any QQuestions?uestions?
Thank you!Thank you!
37
38Level of Organization and Related Field of
Study
39
Central Dogma of Biology
Transcription Translation
example
TTG CTG CGGTranscription
UUG CUG CGGTranslation
Leu Leu Arg
DNA RNA Protein
40
Transcription and Translation
41
Gene
• A gene is a region of DNA that controls a hereditary characteristic, usually corresponding to a single mRNA carrying the information for constructing a protein.
• The human genome contains about 30,000 genes. (February 2001)
42
Introns and Exons
43
44
Pair-wise Sequence Alignment
VLSPADKTNVKAAWAKVGAHAAGHG
||| | | |||| | ||||
VLSEAEWQLVLHVWAKVEADVAGHG
45
Sequence Alignment
Purposes:
Learn about evolutionary relationships
Finding genes, domains, signals …
Classify protein families (function, structure).
Identify common domains (function, structure).
46
Multiple Sequence Alignment
47
Scoring Systems for Alignment
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Sequence 1
Sequence 2
A G C T
A 1 0 0 0
G 0 1 0 0
C 0 0 1 0
T 0 0 0 1
Match: 1Mismatch: 0Score = 5
Simple case
DNA
Scoringmatrix
48
Scoring Systems for Alignment
Complex casePTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Sequence 1
Sequence 2
Scoringmatrix
T:G = -2 T:T = 5Score = 48
C S T P A G N D . .
C 9
S -1 4
T -1 1 5
P -3 -1 -1 7
A 0 1 0 -1 4
G -3 0 -2 -2 0 6
N -3 1 0 -2 -2 0 5
D -3 0 -1 -1 -2 -1 1 6
.
.
Protein
49
Protein Structure
Alpha-helices Beta-sheets Tertiary
Quaternary Diitrogenase as an example
50
Public Databases
• Big 3 Centers
National Center for Biotechnology Information
EBI
DNA Database Bank of Japan
51
The Human Genome
• 23 pairs of chromosomes comprise the human genome.
• The human genome contains 3,164.7 million (or 3 Billion) nucleotide
base.
• The average gene consists of 3,000 bases, but sizes vary greatly,
with the largest known human gene being dystrophin at 2.4 million
bases.
• The total number of genes is estimated at 30,000 to 40,000
• The total number of protein variant is estimated as 1 Million.
52
DNA
Metabolites
RNA
Genomics
Transcriptomics
Metabolomics
ProteomicsProteins
Various Fields in Biology
53
Trends in Molecular Biology
Reverse Genetics
Genome
Genomics
Structural Genomics
Functional Genomics
Function (Mutation)
Gene Function
Gene
Functional Genomics
Genome ProjectHigh Throughput Tech
54
DNA & Bases
A (Adenine), G (Guanine), C (Cytosine), T(Thymine)