machine learning in bioinformaticsbigeye.au.tsinghua.edu.cn/mla11/program_files/zhangqw.pdf ·...
TRANSCRIPT
?
Machine Learning in Bioinformatics
机器学习在生物信息学中的应用
Michael Q. Zhang
张奇伟
Tsinghua University
清华大学
11.5.2011
2
3
DNA and Gene Expression (Body-factory,Cell-housing,Protein-machine,DNA-information)
Challenges: (1) Genetic code (61 to 20 mapping);
(2) Protein folding.
4
1.5 1.0 0.5 0.0
The Human Genome Project (1990-2005) (http://www.nhgri.nih.gov/HGP/)
Mapped Human Genes
Species Genome Size Genes
Human 3x109 80,000
Mouse 3x109 80,000
Fish (Fugu) 4x109 70,000
Fly 1.7x108 20,000
Worm 1x108 18,000
Yeast 1.4x107 6,000
Bacterium (E. coli) 4.7x106 5,000
(byr)
“The new paradigm now emerging, is that all the genes
will be known (in the sense of being resident in
databases available electronically), and that the starting
point of a biological investigation will be theoretical”
W. Gilbert (1991)
5
CF Gene Discovery (1989)
Positional cloning:
•Linkage analysis
•Physical mapping
•cDNA selection
•Sequencing
•Database search
(alignment)
囊性纤维化
© The Author 2006. Published by Oxford University Press. For
Permissions, please email:
Larrañaga P et al. Brief Bioinform
2006;7:86-112
Classification of the topics where
machine learning methods are
applied.
6
7
Gene Organization and Exon Recognition
(Sequence Structure Function)
Human β-Globin
A typical vertebrate gene
E1 E2 E3 E4 E5 E6 E7
I1 I2 I3 I4 I5 I6
DNA
mRNA Splicing
Name Size (kb) MRNA (kb) Introns
-Globin 1.5 0.6 2
Insulin 1.7 0.4 2
Protein kinase C 11 1.4 7
Albumin 25 2.1 14
Catalase 34 1.6 12
LDL receptor 45 5.5 17
Factor VIII 186 9 25
Thyroglobulin 300 8.7 36
Dystrophin > 2000 17 > 50
Some sizes of human genes
8
Computational gene finding methods
• Database search (similarity comparison)
•Genomic sequences
•mRNA sequences
•Random ESTs (expressed sequence tags)
•Protein sequences
• De novo prediction (pattern recognition)
•Statistical rule-based classifications (FGENE,MZEF)
•Linguistic expert text parsers (GeneID,GenLang)
•Neural networks (GeneParser,GRAIL)
•Hidden Markov models (Genie, GENESCAN)
Training set
Statistical or machine learning
Test set Likelihood criteria Prediction
9
Basic ideas of statistical discrimination
•Good feature variables (biological insights): Xi
~ Likelihood( the feature Xi is found in exons vs. introns | Data)
5’ 3’ AG GT
-3 -2 -1 1 2 3 4 5 6-14
-12
-10 -8 -6 -4 -2 1
T
G
C
A
Acceptor site matrix Donor site matrix
Exon size
(log10)
fr1 fr2 fr3
In-frame 6-mers <log(f_ex/f_in)>
•Optimal classification surface (Bayesian inference)
Minimize:
False_positives
&
False_negatives
10
Open problems in gene finding
•Terminal exons
•Very short exons
•Noncoding exons
•Exon assembly
•A+T rich genes
•Multiple genes
•Alternative exons
1 2 3 3 2 1
3 1
Example: alternative splicing of the fly sex determination gene
HMM models
12
Gene regulation and functional genomics (Sequence Structure Function)
DNA
mRNA
Protein
Genome
mRNAs
Proteins
Function Function
The fundamental strategy in a functional genomics approach is to expand
the scope of biological investigation from studying single genes or
proteins to studying all genes or proteins at once in a systematic fashion.
Molecular reductionistic view:
"One gene one function"
Network integrative view:
"Gene function is distributed across a
parallel processing network"
Gene expression
Inheritance Single cell Whole body
13
Yeast cell cycle and promoter architecture
Late G1
Cln/Cdc28
Cln3/Cdc28
kinase
Cell
size
Clb1-4/Cdc28
kinase
MBF SBF
CLN1,2
HO
CLB5,6
&
S phase
proteins
Budding
S phase
CLB1,2
SWI5
ACE2
FAR1?
CDC47
CTS1
EGT1,2
SIC1
PCL9
FAR1?
CLN3
SWI4
CDC6
CDC46
Mitosis
G2 Early G1 M/G1
ECB?
Swi5
Ace2
Nucleus
?
Mcm1
/SFF?
CACGAAA ACGCGT
Clb proteolysis
In M phase
Spellman et al. 1998
Identification of 800 cell cycle genes in yeast
Basic idea of finding cis-
regulatory elements
1.Array: Clustering co-regulated genes by
expression profiles, search for common motif(s)
2.ChIP-chip: Cross-linking TF to chromatin in
vivo, sheer + IP, intergenic chip hybridyzation
ChIP
chip
Serial Regulation of Transcriptional
Regulators in the Yeast Cell Cycle Simon et al. Cell 106:697-708 (2001).
MARS: Multivariate Adaptive Regression
Splines (J. Friedman, Annals of Stat. 1991)
Multivariate extension of one dimensional splines
Linear splines are made of
piecewise linear functions
Basis functions [h1(x)]:
(x-ξ)+ = max(0,x-ξ)
= x-ξ, x > ξ
= 0, otherwise
(ξ-x)+ = max(0, ξ-x)
= 0, x > ξ
= ξ-x, otherwise
ξ x
(x-ξ)+
knot
ξ x
(ξ-x)+
log(Eg/Egc) = β0 + Σm=1M βm hm({nμg})
Spacing, orientation
19
Regulatory network dynamics
and metabolic pathway reconstruction
Metabolic reprogramming
inferred from global
analysis of changes in
yeast gene expression
under the diauxic shift
(from anaerobic growth to
aerobic respiration upon
depletion of glucose.
(DeRisi, et al. 1997)
A sector of an imaginary
developmental gene regulatory
network. (A) Three transcription
factors and their target
developmental genes; (B) A
single relationship extracted
from the network.
(Davidson,1997)
20
21
22
PMNs physical module.
Novershtern N et al. Bioinformatics
2011;27:i177-i185
© The Author(s) 2011. Published by Oxford University Press.
23
Systems & Synthetic
Biology Integrative technology - What we understand is
determined by what we can model and build
Science: What’s in nature?
- Curiosity is the driving force of discovery
(intellectual nature)
Engineering: What can be better than nature? (human extension)
- Necessity is the mother of invention (human need)
Richard Feynman:
Molecular computing
and adaptive learning: copy,
mutate, recombine, etc.
Information gain -> complexity Genomes encode survival learning
experience of species evolutionary history
Bacterial chemotaxis: a simple biological
phenomenon
(From the Berg lab)
run tumble
Switch between tumble and run by comparing current environment
with internal “memory” (knowledge) of the past
biased towards more favorable environments (food, temperature, etc.)
Yuhai Tu
E. Coli chemotaxis: the H-atom of
systems biology
2 m
~10Flagella mReceptors
Flagellar Motor
Nucleoid region Ribosomes
Reproduce every ~20 minutes
Under normal condition
How cells 1) receive signal; 2) process signal and 3) react to signal Yuhai Tu
Energy cost of information processing
(Rolf Landauer, Nature 1988)
Where we are now
Where E. coli has been
Perhaps we can learn something from biology about efficient
computing??
• The energy dissipation sets the limit of the
performance. • The design of the
network is to approach this thermodynamic limit.
Yuhai Tu
The Landauer-Von Neumann limit of computing
Network Topologies for Perfect
Adaption Comment by Artyukhin et al 2009.
Electric vs. cellular circuit designs? Complexity?
16038
Learn
ing se
gme
ntatio
n n
etw
ork (Fran
cois et al. 2
00
7) R
equ
ire for fitn
ess: 1
. Assign
a nu
mb
er to an
y collectio
n o
f selector gen
es Ci(x)
2. M
ax. diversity... m
any selecto
r genes exp
ressed
in e
mb
ryo
3. M
in. d
iversity for given
x... (un
iqu
e fate)
Evolution of two segmentation networks in a static morphogen gradient. Two different evolutionary pathways are displayed
(A–C, D–G). Successive stages run from left to right and show both the network and the spatial profile of the proteins. Note
that the first two stages are common to both evolutionary trajectories. The morphogen G is depicted in black, the protein E defining the segments is in blue, and the repressors R1 and R2 are in red (dashed lines represent the last to be added).
Concentrations have been normalized by their maximum value for plotting purposes. See the text for details.
Convergent evolution
Neuron (a learning machine; Brain, cluster of learning machines)
Thomas Serre
Op
tical stimu
lation
of an
terior/p
osterio
r mech
ano
senso
ry
neu
ron
s or fo
rward
/back
ward
com
man
d in
terneu
ron
s
Social and solitary feeding in C. elegans Bono and Bargmann (1998)
a | solitary N2 worms and b | social AB1 strains feeding on a
lawn of Escherichia coli. N2 worms are evenly distributed
throughout the bacterial lawn, whereas AB1 worms clump along the borders of
the bacterial lawn and feed in groups. The difference in
behaviour can be accounted for by a change in a single
base pair.
Complexity of the human brain
Josh. Huang
Josh Huang
Cognitive Computing
Thank you!