machine learning in bioinformaticsbigeye.au.tsinghua.edu.cn/mla11/program_files/zhangqw.pdf ·...

39
? Machine Learning in Bioinformatics 机器学习在生物信息学中的应用 Michael Q. Zhang 张奇伟 Tsinghua University 清华大学 11.5.2011

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

?

Machine Learning in Bioinformatics

机器学习在生物信息学中的应用

Michael Q. Zhang

张奇伟

Tsinghua University

清华大学

11.5.2011

Page 2: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

2

Page 3: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

3

DNA and Gene Expression (Body-factory,Cell-housing,Protein-machine,DNA-information)

Challenges: (1) Genetic code (61 to 20 mapping);

(2) Protein folding.

Page 4: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

4

1.5 1.0 0.5 0.0

The Human Genome Project (1990-2005) (http://www.nhgri.nih.gov/HGP/)

Mapped Human Genes

Species Genome Size Genes

Human 3x109 80,000

Mouse 3x109 80,000

Fish (Fugu) 4x109 70,000

Fly 1.7x108 20,000

Worm 1x108 18,000

Yeast 1.4x107 6,000

Bacterium (E. coli) 4.7x106 5,000

(byr)

“The new paradigm now emerging, is that all the genes

will be known (in the sense of being resident in

databases available electronically), and that the starting

point of a biological investigation will be theoretical”

W. Gilbert (1991)

Page 5: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

5

CF Gene Discovery (1989)

Positional cloning:

•Linkage analysis

•Physical mapping

•cDNA selection

•Sequencing

•Database search

(alignment)

囊性纤维化

Page 6: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

© The Author 2006. Published by Oxford University Press. For

Permissions, please email:

[email protected]

Larrañaga P et al. Brief Bioinform

2006;7:86-112

Classification of the topics where

machine learning methods are

applied.

6

Page 7: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

7

Gene Organization and Exon Recognition

(Sequence Structure Function)

Human β-Globin

A typical vertebrate gene

E1 E2 E3 E4 E5 E6 E7

I1 I2 I3 I4 I5 I6

DNA

mRNA Splicing

Name Size (kb) MRNA (kb) Introns

-Globin 1.5 0.6 2

Insulin 1.7 0.4 2

Protein kinase C 11 1.4 7

Albumin 25 2.1 14

Catalase 34 1.6 12

LDL receptor 45 5.5 17

Factor VIII 186 9 25

Thyroglobulin 300 8.7 36

Dystrophin > 2000 17 > 50

Some sizes of human genes

Page 8: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

8

Computational gene finding methods

• Database search (similarity comparison)

•Genomic sequences

•mRNA sequences

•Random ESTs (expressed sequence tags)

•Protein sequences

• De novo prediction (pattern recognition)

•Statistical rule-based classifications (FGENE,MZEF)

•Linguistic expert text parsers (GeneID,GenLang)

•Neural networks (GeneParser,GRAIL)

•Hidden Markov models (Genie, GENESCAN)

Training set

Statistical or machine learning

Test set Likelihood criteria Prediction

Page 9: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

9

Basic ideas of statistical discrimination

•Good feature variables (biological insights): Xi

~ Likelihood( the feature Xi is found in exons vs. introns | Data)

5’ 3’ AG GT

-3 -2 -1 1 2 3 4 5 6-14

-12

-10 -8 -6 -4 -2 1

T

G

C

A

Acceptor site matrix Donor site matrix

Exon size

(log10)

fr1 fr2 fr3

In-frame 6-mers <log(f_ex/f_in)>

•Optimal classification surface (Bayesian inference)

Minimize:

False_positives

&

False_negatives

Page 10: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

10

Open problems in gene finding

•Terminal exons

•Very short exons

•Noncoding exons

•Exon assembly

•A+T rich genes

•Multiple genes

•Alternative exons

1 2 3 3 2 1

3 1

Example: alternative splicing of the fly sex determination gene

Page 11: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

HMM models

Page 12: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

12

Gene regulation and functional genomics (Sequence Structure Function)

DNA

mRNA

Protein

Genome

mRNAs

Proteins

Function Function

The fundamental strategy in a functional genomics approach is to expand

the scope of biological investigation from studying single genes or

proteins to studying all genes or proteins at once in a systematic fashion.

Molecular reductionistic view:

"One gene one function"

Network integrative view:

"Gene function is distributed across a

parallel processing network"

Gene expression

Inheritance Single cell Whole body

Page 13: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

13

Yeast cell cycle and promoter architecture

Late G1

Cln/Cdc28

Cln3/Cdc28

kinase

Cell

size

Clb1-4/Cdc28

kinase

MBF SBF

CLN1,2

HO

CLB5,6

&

S phase

proteins

Budding

S phase

CLB1,2

SWI5

ACE2

FAR1?

CDC47

CTS1

EGT1,2

SIC1

PCL9

FAR1?

CLN3

SWI4

CDC6

CDC46

Mitosis

G2 Early G1 M/G1

ECB?

Swi5

Ace2

Nucleus

?

Mcm1

/SFF?

CACGAAA ACGCGT

Clb proteolysis

In M phase

Page 14: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Spellman et al. 1998

Identification of 800 cell cycle genes in yeast

Page 15: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Basic idea of finding cis-

regulatory elements

1.Array: Clustering co-regulated genes by

expression profiles, search for common motif(s)

2.ChIP-chip: Cross-linking TF to chromatin in

vivo, sheer + IP, intergenic chip hybridyzation

ChIP

chip

Page 17: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

MARS: Multivariate Adaptive Regression

Splines (J. Friedman, Annals of Stat. 1991)

Multivariate extension of one dimensional splines

Linear splines are made of

piecewise linear functions

Basis functions [h1(x)]:

(x-ξ)+ = max(0,x-ξ)

= x-ξ, x > ξ

= 0, otherwise

(ξ-x)+ = max(0, ξ-x)

= 0, x > ξ

= ξ-x, otherwise

ξ x

(x-ξ)+

knot

ξ x

(ξ-x)+

log(Eg/Egc) = β0 + Σm=1M βm hm({nμg})

Page 18: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Spacing, orientation

Page 19: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

19

Regulatory network dynamics

and metabolic pathway reconstruction

Metabolic reprogramming

inferred from global

analysis of changes in

yeast gene expression

under the diauxic shift

(from anaerobic growth to

aerobic respiration upon

depletion of glucose.

(DeRisi, et al. 1997)

A sector of an imaginary

developmental gene regulatory

network. (A) Three transcription

factors and their target

developmental genes; (B) A

single relationship extracted

from the network.

(Davidson,1997)

Page 20: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

20

Page 21: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

21

Page 22: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

22

Page 23: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

PMNs physical module.

Novershtern N et al. Bioinformatics

2011;27:i177-i185

© The Author(s) 2011. Published by Oxford University Press.

23

Page 24: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Systems & Synthetic

Biology Integrative technology - What we understand is

determined by what we can model and build

Science: What’s in nature?

- Curiosity is the driving force of discovery

(intellectual nature)

Engineering: What can be better than nature? (human extension)

- Necessity is the mother of invention (human need)

Richard Feynman:

Page 25: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Molecular computing

and adaptive learning: copy,

mutate, recombine, etc.

Information gain -> complexity Genomes encode survival learning

experience of species evolutionary history

Page 26: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Bacterial chemotaxis: a simple biological

phenomenon

(From the Berg lab)

run tumble

Switch between tumble and run by comparing current environment

with internal “memory” (knowledge) of the past

biased towards more favorable environments (food, temperature, etc.)

Yuhai Tu

Page 27: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

E. Coli chemotaxis: the H-atom of

systems biology

2 m

~10Flagella mReceptors

Flagellar Motor

Nucleoid region Ribosomes

Reproduce every ~20 minutes

Under normal condition

How cells 1) receive signal; 2) process signal and 3) react to signal Yuhai Tu

Page 28: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Energy cost of information processing

(Rolf Landauer, Nature 1988)

Where we are now

Where E. coli has been

Perhaps we can learn something from biology about efficient

computing??

• The energy dissipation sets the limit of the

performance. • The design of the

network is to approach this thermodynamic limit.

Yuhai Tu

The Landauer-Von Neumann limit of computing

Page 29: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Network Topologies for Perfect

Adaption Comment by Artyukhin et al 2009.

Electric vs. cellular circuit designs? Complexity?

16038

Page 30: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Learn

ing se

gme

ntatio

n n

etw

ork (Fran

cois et al. 2

00

7) R

equ

ire for fitn

ess: 1

. Assign

a nu

mb

er to an

y collectio

n o

f selector gen

es Ci(x)

2. M

ax. diversity... m

any selecto

r genes exp

ressed

in e

mb

ryo

3. M

in. d

iversity for given

x... (un

iqu

e fate)

Page 31: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Evolution of two segmentation networks in a static morphogen gradient. Two different evolutionary pathways are displayed

(A–C, D–G). Successive stages run from left to right and show both the network and the spatial profile of the proteins. Note

that the first two stages are common to both evolutionary trajectories. The morphogen G is depicted in black, the protein E defining the segments is in blue, and the repressors R1 and R2 are in red (dashed lines represent the last to be added).

Concentrations have been normalized by their maximum value for plotting purposes. See the text for details.

Convergent evolution

Page 32: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Neuron (a learning machine; Brain, cluster of learning machines)

Thomas Serre

Page 33: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Op

tical stimu

lation

of an

terior/p

osterio

r mech

ano

senso

ry

neu

ron

s or fo

rward

/back

ward

com

man

d in

terneu

ron

s

Page 34: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Social and solitary feeding in C. elegans Bono and Bargmann (1998)

a | solitary N2 worms and b | social AB1 strains feeding on a

lawn of Escherichia coli. N2 worms are evenly distributed

throughout the bacterial lawn, whereas AB1 worms clump along the borders of

the bacterial lawn and feed in groups. The difference in

behaviour can be accounted for by a change in a single

base pair.

Page 35: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Complexity of the human brain

Josh. Huang

Page 36: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用
Page 37: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Josh Huang

Page 38: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Cognitive Computing

Page 39: Machine Learning in Bioinformaticsbigeye.au.tsinghua.edu.cn/MLA11/program_files/zhangqw.pdf · 2019-02-27 · Machine Learning in Bioinformatics 机器学习在生物信息学中的应用

Thank you!