opportunities of systems engineering/operations research in bioinformatics

54
Opportunities of Systems Engineering/Operations Research in Bioinformatics Hyoungtae Kim (Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU)

Upload: marlee

Post on 09-Jan-2016

16 views

Category:

Documents


1 download

DESCRIPTION

Opportunities of Systems Engineering/Operations Research in Bioinformatics. Hyoungtae Kim (Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU). Outline. Introduction on Bioinformatics Paradigm Shift in Biology Systems Engineering/Operations Research for Bioinformatics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Opportunities of Systems Engineering/Operations Research in Bioinformatics

Opportunities of Systems Engineering/Operations Research

in BioinformaticsHyoungtae Kim

(Joint work with Wiljeana Jackson, S.C. LIN and Dr. JC LU)

Page 2: Opportunities of Systems Engineering/Operations Research in Bioinformatics

2

Outline

• Introduction on Bioinformatics

• Paradigm Shift in Biology

• Systems Engineering/Operations Research for Bioinformatics

• About Funding Opportunities

• Conclusions

Page 3: Opportunities of Systems Engineering/Operations Research in Bioinformatics

3What is Bioinformatics/Computational Molecular Biology?

• An application of mathematical, statistical, and computational tools in the analysis of the huge size biological data

• Most of the cases, it involves analyzing information stored in large databases

• Multi-disciplinary: -Biology -Mathematics -Statistics-Physics -Chemistry -Computer Science-Engineering

It has not yet found its own natural home department

Page 4: Opportunities of Systems Engineering/Operations Research in Bioinformatics

4

Why Bioinformatics?

• Current data analysis tools are far from being efficient for analyzing vast amount of biological data

• The pace of biological understanding is much slower than the pace of the technology advance that have powered experimental discovery and data collection

• Benefits: Advances in detection and treatment of disease and the production of genetically engineered foods

Profound impact on health and medicine

Page 5: Opportunities of Systems Engineering/Operations Research in Bioinformatics

5Three Elements of Bioinformatics Research

Significant Biological problems-Gene, motif, signal recognition

-Protein structure prediction-Metabolic pathway deduction

-Etc.

Data-Microarrays

-Mass Spectroscopy-Etc.

Theory & Methods-Algorithms

-Statistical Methods-Ontologies

-Etc.

Page 6: Opportunities of Systems Engineering/Operations Research in Bioinformatics

6

Prerequisites of Bioinformatics

• Basic knowledge in Molecular Biology– Prokaryotic and Eukaryotic cells– Genes, Codons, DNA, RNA, Central dogma of biology– Etc.

• Computing Skills– Program Languages: Python, Perl, Java, etc.– Knowledge in Relational Databases, etc.

• Other Skills– General Statistical Knowledge– Optimization Tools: Math Programming, Network

Optimization, etc.

Scientific Mind

++

Page 7: Opportunities of Systems Engineering/Operations Research in Bioinformatics

7

• DNA and Protein Sequence Analysis• Gene Finding and Prediction• Etc.

• Microarray Experiment and Data Analysis• Protein Structure Prediction • Deduction of Metabolic Pathways• And more…

Standard ProblemsStandard Problems

Emerging ProblemsEmerging Problems

Various Problems in Bioinformatics

Page 8: Opportunities of Systems Engineering/Operations Research in Bioinformatics

8

Outline

• Introduction to Bioinformatics

• Paradigm Shift in Biology

• Systems Engineering/Operations Research for Bioinformatics

• About Funding Opportunities

• Concluding Remarks

Page 9: Opportunities of Systems Engineering/Operations Research in Bioinformatics

9Paradigm Shift in Biology

– Goal of the HGP = sequencing of the human genome

– Hypothesis driven reductionism discovery science approach

– Drive-forced the development of high throughput technologies and computer applications to transmit, analyze, and model very large size data sets

• The Human Genome Project (HGP)– Working Draft of the human genome (2001)

Page 10: Opportunities of Systems Engineering/Operations Research in Bioinformatics

10

Paradigm Shift in Biology

– Microarrays – allow the expression of thousands of genes to be surveyed at one time

– Protein Arrays – can examine all proteins in a cell and check if they are interacting under designed conditions

– Mass Spectrometry – The basic modality is protein mass fingerprinting

• High-throughput Technologies

Page 11: Opportunities of Systems Engineering/Operations Research in Bioinformatics

11

Paradigm Shift in Biology

– Allows data collection in high-throughput manner – Can put all genes in a microbe on a chip

• Microarray Chip Technology

Similarity Matrix

Or

Distance Matrix

samples

genes

“Raw Data”

Expression Level

genes

genes

• Interpretation of the data is very challenging

Page 12: Opportunities of Systems Engineering/Operations Research in Bioinformatics

12

patients with or without cancer

genes

253x15154 Microarray Gene Expression Data: 162 cancer vs 91 normal patients

Page 13: Opportunities of Systems Engineering/Operations Research in Bioinformatics

13

Paradigm Shift in Biology

Genes and proteins

Gene activity data Protein-protein interaction data

Protein structure data Proteomic data

Metabolite dataRegulatory elements

Blackbox

Gigantic amount of biological information is hidden in these data and their inter-data relationship!

Page 14: Opportunities of Systems Engineering/Operations Research in Bioinformatics

14

Paradigm Shift in Biology

• Concept of Systems Biology

– The Reductionist paradigm has been phenomenally successful in biology since 1950’s

– Genomics era exhaustive lists of biological parts (i.e. genes and proteins) together with their functional characteristics

– A System-level perspective is required to make sense of how all of these individual parts emerge and act collectively to perform a biological function

Page 15: Opportunities of Systems Engineering/Operations Research in Bioinformatics

15

Outline

• Introduction to Bioinformatics

• Paradigm Shift in Biology

• Systems Engineering/Operations Research tools for Bioinformatics

• About Funding Opportunities

• Concluding Remarks

Page 16: Opportunities of Systems Engineering/Operations Research in Bioinformatics

16

Systems Engineering/Operations Research tools

• Three CategoriesStochastics

–Hidden Markov Models

–MCMC

–Simulation Models

–Etc.

Network & Optimization

–Combinatorial

–Integer Programming

–Dynamic Programming

–Network Optimization

–Minimum Spanning Tree

–Etc.

Statistics

–MLE

–Regression

–Sampling

–Linear Model

–Cross Validation

–Statistical Estimation and Test

–Multivariate Analysis (or ANOVA)

–Wavelet Transformation

–Bayesian Networks, Etc.

Page 17: Opportunities of Systems Engineering/Operations Research in Bioinformatics

17

Systems Engineering tools for Bioinformatics

• Hidden Markov Model for Gene Finding• Dynamic Programming for Sequence Alignment• Integer Programming for Protein Folding• Minimum Spanning Tree approach to Clustering

for Motif Identification (Xu et al. (2001)

• And many more …

• Some Examples

Page 18: Opportunities of Systems Engineering/Operations Research in Bioinformatics

18

• Identification of Transcription Factor Binding Sites(Motifs)

• A gene’s transcriptional level is regulated by proteins (transcription factors), which bind to specific sites in the gene’s promoter region, called binding sites

• The binding-site identification problem is to find short “conserved” fragments, from a set of genomic sequences

Features of transcription factor binding site

1. These short DNA fragments in the upstream regions of genes are generally very similar to each other

2. Relatively high frequencies compared to other sequence fragments

A Significant Biological Problem

Page 19: Opportunities of Systems Engineering/Operations Research in Bioinformatics

19

• Data Set (D)= Set of All Short DNA fragments in the upstream regions of genes

• Microarray gene expression technologies allow simultaneous view of the transcription levels of many thousands of genes under various cellular conditions

Data Collection

Experiment Find group of genes having

correlated expression

profiles

Upstream regions of genes

GATCACCTGACATCAGGAGTTCAAGACCAGCCTGCCAACG

CCATCTCTACTAAAAATAGGAAATTCACCTGGTGGCAGGT

CCAGCTACTCGGGAGGCTGAGGCAGAAGAATCGCTTGAAT

GAGATTGCACTGAGCTGAGATCACGCCACTGCGCTCCAGC

GAGCAAGACTCCATAAAAAAAAAAATTATAACCTAATGAT

AGGGAAGAGCTTACCACAATTGCTGGCCCATGGCCAATGC

ACAGCTACTGCAAACAACCATGATGATGATACATCTCTTG

GGTTGTTTGAGACACATTCTATGCTCCTTGATTTGATTGG

GGTTCCTTGGGGACTTGGAGGTGACGAAAGCCTCCCTGGG

ACCTTCACTTCTCTAATATCAAGCTTCAGCAACCTGCTCC

CAGGGTTGGACAGGCCCAACAACAGAGGAAATCCACAAAG

CACATACATCCACGGGGTCTAACGAGGTGAGGCCAATGAC

CACCCCAGCCAGACTCTGACTTCACTCCCGGCAGGTTTCA

CAGCAGTTGGAGCGAGCTGGCTTCTTGCGGTAGGCAGCCA

GCTCCCAATAGTCCTCGTTTCCTGGTAATCTCATGCTTGG

Page 20: Opportunities of Systems Engineering/Operations Research in Bioinformatics

20

• Some testing data sets are available on the internet or in the literature

For example • CRP binding sites: 18 sequences with 105 BPs• Yeast binding sites: 8 sequences with 1000 BPs• Human binding sites: 113 sequences with 30 BPs

Page 21: Opportunities of Systems Engineering/Operations Research in Bioinformatics

21

CRP binding sites: 18 sequences with 105 BPs

A C T G

Page 22: Opportunities of Systems Engineering/Operations Research in Bioinformatics

22

• Traditional approaches

• Various sampling techniques including Gibbs sampling• EM Algorithm• Greedy Algorithm• Multi-Order Markov Chain Algorithm

All these are heuristic algorithms so this problem remains as a challenging and unsolved problem

Theory & Methods

Page 23: Opportunities of Systems Engineering/Operations Research in Bioinformatics

23

• Input = A graph, G = (V,E), with weighted edges• Output = the cheapest subset of edges that keeps the gra

ph in one connected component

Brief Review: Minimum Spanning Tree

• Two Popular Algorithms

• Kruskal’s Algorithm • Prim’s Algorithm

Page 24: Opportunities of Systems Engineering/Operations Research in Bioinformatics

24

• Minimum Spanning Tree approach

• Step1: Define a distance measure () on the data set (D), and compute distances b/w each pair of data points (i.e., (A,B) for all A, B in D)• Higher the sequence similarity b/w two fragments, smalle

r the distance is b/w their mapped positions

Theory & Methods

Page 25: Opportunities of Systems Engineering/Operations Research in Bioinformatics

25

• Minimum Spanning Tree approach

• Step2: Find the MST ,T, representing D with its edge weight defined by and treat it as a data clustering problem

e1

e2

e3

4 Clusters, c1~c4, are identifiedRemove three edges e1,e2,e3

Theory & Methods

c1c1

c2c2

c3c3

c4c4T

Page 26: Opportunities of Systems Engineering/Operations Research in Bioinformatics

26

• Comparison with Other Methods

• MST is based on a combinatorial approach

can identify all clusters of possible binding sites

• While existing heuristic methods are likely to miss some clusters

• Implemented result is at least as good as results by other methods

• While Simple structure of a tree facilitates efficient implementations of rigorous algorithm

Evaluation of the MST Method

Page 27: Opportunities of Systems Engineering/Operations Research in Bioinformatics

27

Outline

• Introduction to Bioinformatics

• Paradigm Shift in Biology

• Systems Engineering/Operations Research tools for Bioinformatics

• About Funding Opportunities

• Concluding Remarks

Page 28: Opportunities of Systems Engineering/Operations Research in Bioinformatics

28

DoD 11%

DOE 10%

NASA2 10%

NSF 7%

USDA3%

HHS 52%

All other agencies

7%

Funding Overviews by Funding Institutions(Top)/Field of Research(Bot)

Total of $54.1 billion in FY2004

54%

17%

10%

7%

5% 2% 2% 3%

Life scienceEngineering

Physical science

Environmental science

$9.1 billion $29.3 billion

Percentage of Total Federal Funding: Preliminary 2004 StatisticsSource: National Science Foundation/Division of Science Resources Statistics, Survey of Federal Funds for Research

Page 29: Opportunities of Systems Engineering/Operations Research in Bioinformatics

29

How to Search for Funding Opportunities?

• NIH Computer Retrieval of Information on Scientific Projects (CRISP)• http://crisp.cit.nih.gov

• NIH Office of Extramural Research (OER)• http://grants1.nih.gov

• Other Websites• http://www.grants.gov• http://fedgrants.gov• http://www.nsf.gov/pubsys/ods/index.html

Page 30: Opportunities of Systems Engineering/Operations Research in Bioinformatics

30

Growing Opportunities in Bioinformatics

New Grants for Bioinformatics

170191

300

0

50

100

150

200

250

300

350

2002 2003 2004

Fiscal Year

# N

ew

Gra

nts

From CRISP Search Data

Page 31: Opportunities of Systems Engineering/Operations Research in Bioinformatics

31

# NIH Grants in Bioinformatics, 826

Microarray, 214 grants

Cancer,63 grants

Systems Biology, 80 grants

NIH Funded Projects in 2004

From CRISP Search Data

• Searched all Related Institutes, Centers, and States for the 2004 Fiscal Year

Page 32: Opportunities of Systems Engineering/Operations Research in Bioinformatics

32

NIH Funding Opportunities for 2004 ~

• 2004 Program Announcement (PA)• Total 171 PAs

• Larger variety of topics• Cancer most prevalent topic• Many wish to have “multidisciplinary” outlook on topics

• 2005 Requests For Application (RFA)• Total 68 RFAs• Although listed for 2005, some application deadlines have passed• 2 directly related to bioinformatics• Cancer still most prevalent topic

From http://grants1.nih.gov

Page 33: Opportunities of Systems Engineering/Operations Research in Bioinformatics

33

Outline

• Introduction to Bioinformatics

• Paradigm Shift in Biology

• Systems Engineering/Operations Research for Bioinformatics

• About Funding Opportunities

• Conclusions

Page 34: Opportunities of Systems Engineering/Operations Research in Bioinformatics

34

Developing Potential Research Plans

1. Systems Engineers/Operations Research Society already have tools to solve various bioinformatics problems

2. Moneys are there to support your research

• Two Takeaways

Then, what do we need to start?

Biological Problems to solve

Page 35: Opportunities of Systems Engineering/Operations Research in Bioinformatics

35

Concluding Remarks!!Concluding Remarks!!

• The main driving force of bioinformatics/computational biology is the high-throughput data production

• I.E. tools together with computing power can play an important role in this process

• Funding opportunities in this area are very rich

Page 36: Opportunities of Systems Engineering/Operations Research in Bioinformatics

36

Any Any QQuestions?uestions?

Thank you!Thank you!

Page 37: Opportunities of Systems Engineering/Operations Research in Bioinformatics

37

Page 38: Opportunities of Systems Engineering/Operations Research in Bioinformatics

38Level of Organization and Related Field of

Study

Page 39: Opportunities of Systems Engineering/Operations Research in Bioinformatics

39

Central Dogma of Biology

Transcription Translation

example

TTG CTG CGGTranscription

UUG CUG CGGTranslation

Leu Leu Arg

DNA RNA Protein

Page 40: Opportunities of Systems Engineering/Operations Research in Bioinformatics

40

Transcription and Translation

Page 41: Opportunities of Systems Engineering/Operations Research in Bioinformatics

41

Gene

• A gene is a region of DNA that controls a hereditary characteristic, usually corresponding to a single mRNA carrying the information for constructing a protein.

• The human genome contains about 30,000 genes. (February 2001)

Page 42: Opportunities of Systems Engineering/Operations Research in Bioinformatics

42

Introns and Exons

Page 43: Opportunities of Systems Engineering/Operations Research in Bioinformatics

43

Page 44: Opportunities of Systems Engineering/Operations Research in Bioinformatics

44

Pair-wise Sequence Alignment

VLSPADKTNVKAAWAKVGAHAAGHG

||| | | |||| | ||||

VLSEAEWQLVLHVWAKVEADVAGHG

Page 45: Opportunities of Systems Engineering/Operations Research in Bioinformatics

45

Sequence Alignment

Purposes:

Learn about evolutionary relationships

Finding genes, domains, signals …

Classify protein families (function, structure).

Identify common domains (function, structure).

Page 46: Opportunities of Systems Engineering/Operations Research in Bioinformatics

46

Multiple Sequence Alignment

Page 47: Opportunities of Systems Engineering/Operations Research in Bioinformatics

47

Scoring Systems for Alignment

actaccagttcatttgatacttctcaaa

taccattaccgtgttaactgaaaggacttaaagact

Sequence 1

Sequence 2

A G C T

A 1 0 0 0

G 0 1 0 0

C 0 0 1 0

T 0 0 0 1

Match: 1Mismatch: 0Score = 5

Simple case

DNA

Scoringmatrix

Page 48: Opportunities of Systems Engineering/Operations Research in Bioinformatics

48

Scoring Systems for Alignment

Complex casePTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Sequence 1

Sequence 2

Scoringmatrix

T:G = -2 T:T = 5Score = 48

C S T P A G N D . .

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 5

D -3 0 -1 -1 -2 -1 1 6

.

.

Protein

Page 49: Opportunities of Systems Engineering/Operations Research in Bioinformatics

49

Protein Structure

Alpha-helices Beta-sheets Tertiary

Quaternary Diitrogenase as an example

Page 50: Opportunities of Systems Engineering/Operations Research in Bioinformatics

50

Public Databases

• Big 3 Centers

National Center for Biotechnology Information

EBI

DNA Database Bank of Japan

Page 51: Opportunities of Systems Engineering/Operations Research in Bioinformatics

51

The Human Genome

• 23 pairs of chromosomes comprise the human genome.

• The human genome contains 3,164.7 million (or 3 Billion) nucleotide

base.

• The average gene consists of 3,000 bases, but sizes vary greatly,

with the largest known human gene being dystrophin at 2.4 million

bases.

• The total number of genes is estimated at 30,000 to 40,000

• The total number of protein variant is estimated as 1 Million.

Page 52: Opportunities of Systems Engineering/Operations Research in Bioinformatics

52

DNA

Metabolites

RNA

Genomics

Transcriptomics

Metabolomics

ProteomicsProteins

Various Fields in Biology

Page 53: Opportunities of Systems Engineering/Operations Research in Bioinformatics

53

Trends in Molecular Biology

Reverse Genetics

Genome

Genomics

Structural Genomics

Functional Genomics

Function (Mutation)

Gene Function

Gene

Functional Genomics

Genome ProjectHigh Throughput Tech

Page 54: Opportunities of Systems Engineering/Operations Research in Bioinformatics

54

DNA & Bases

A (Adenine), G (Guanine), C (Cytosine), T(Thymine)