machine learning for biological data mining · 2015-12-16 · machine learning techniques for bio...
TRANSCRIPT
1
Machine Learning for Biological Data Mining
장 병 탁
서울대 컴퓨터공학부E-mail: [email protected]://scai.snu.ac.kr./~btzhang/
Byoung-Tak ZhangSchool of Computer Science and Engineering
Seoul National University
This material is available at http://scai.snu.ac.kr/~btzhang/
2
Outline
? Basics in Molecular Biology
? Current Issues and Applications
?Machine Learning for Bioinformatics
? DNA Chip Data Mining
? Graphical Models for Gene Expression Analysis
? Summary
2
3
What is Bioinformatics?
? Bio – molecular biology? Informatics – computer science? BioInformatics – solving problems arising from
biology using methodology from computer science.
? BioInformatics vs. Computationl Biology
4
Basics in Molecular Biology
3
5
DNA Double Helix
6
DNA Base-pairs
4
7
DNA
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGCATGCAATCAGTCCCGTTGCTTCGGCACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGCTATTGTACCCGTTGCTTCGGATCTCTTGGGGATCTCTTGGTTCCGGCATGCAATCAGTCCCGTTGCTTCGGCACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCACCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGGCCGCCGGGGGCACTGTCTGAAAGCTCGGCCGCC
8
Human Genome Sequenced!
? “The most wondrous map ever produced by human kind”
? Scientists jointly announced that they had obtained a near complete set of the biochemical instructions for human life.
? “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”
5
9
Some Facts
? DNA differs between humans by 0.2%, (1 in 500 bases).
? Human DNA is 98% identical to that of chimpanzees.
? 97% of DNA in the human genome has no known function.
? 3.109 letters in the DNA code in every cell in your body.
? 1014 cells in the body.? 12,000 letters of DNA decoded by the Human
Genome Project every second.
10
Molecular Biology: Flow of Information
DNA RNA Protein Function
DNA Protein
AC
TG
GA
AGGTGTGC
PheCysLysCysAspCysArgSerA
laLeu
6
11
Using the Genome
? Redundancy in genetic information
? Single genes have multiple functions
? Genes 1-D, gene products 3-D
Genetic Information
Molecular Structure
Biochemical Function
Biological Behavior
Molecular Dynamics
Biophysics
Biochemistry
12
Gene Structure
DNA “gene” search
RNA
Proteinsequence
FoldedProtein
compute
compute
?how?
7
13
DNA (gene) RNA Protein
controlstatement
TATA start
Terminationstop
controlstatement
Ribosomebinding
gene
Transcription (RNA polymerase)
mRNA
Protein
Transcription (Ribosome)
5’ utr 3’ utr
14
Numbers of Genes
? Humans 25,000 - 40,000
? C. elegans (worm): 19,000
? S. cerevisiae (yeast) 6,000
? Tuberculosis microbe 4,000
8
15
Genetic Code: 3 bases=1amino acid
FirstPosition(5’end)
T
C
A
G
T C A G
Second positionThirdPosition(3’end)
TC
AG
TC
AG
TC
AG
TC
A
G
PhePhe
LeuLeu
LeuLeu
Leu
Leu
llelle
Lle
Met
ValVal
ValVal
AlaAla
AlaAla
ThrThr
Thr
Thr
ProPro
ProPro
SerSer
Ser
Ser
TyrTyr
STOP
STOPHisHis
GlnGln
AsnAsn
Lys
Lys
AspAsp
Glu
Giu
CysCys
STOPTrp
ArgArg
Arg
Arg
SerSer
Arg
Arg
GlyGly
Gly
Gly
16
Nucleotide Sequence
aacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc
gcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgcaacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc gcgggcccgc cgcttgtcggccgccggggg ggcgcctctg
cgcttgtcgg ccgccgggggccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg aacctgcggaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcg
SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other
9
17
Protein (Amino Acid) Sequence
CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR IMGKVATRAS SKGVKSTLGT RGALENISNV ARNNLQAGAKKELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSALEAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPSEIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC
18
Protein Structure
10
19
Human Genetic Variations(Single Nucleotide Polymorphisms)? SNP’s- “genetic individuality”? ~1/1000 bases variable (2 humans)?Make us more/less susceptible to diseases?May influence the effect of drug treatments
TTTGCTCCGTTTTCA
TTTGCTCYGTTTTCA
TTTGCTCTGTTTTCA
20
SNP (Single Nucleotide Polymorphism)? Finding single nucleotide changes at specific regions of
genes
?Diagnosis of hereditary diseases?Personal drug?Finding more effective drugs and
treatments
11
21
Human Individuality
22
Flood of Data! (SWISS-PROT)
1988 1990 1992 1994 1996
80
70
60
50
40
30
20
10
0
Year of release
Num
ber
of se
que
nces
x 1
000
12
23
How Can We Analyze the Flood of Data??Data: don’t just store it, analyze it! By
comparing sequences, one can find out about things like? ancestors of organisms? phylogenetic trees? protein structures? protein function
24
Bioinformatics Is About:
? Elicitation of DNA sequences from genetic material
? Sequence annotation (e.g. with information from experiments)
? Understanding the control of gene expression (i.e. under what circumstances proteins are transcribed from DNA)
? The relationship between the amino acid sequence of proteins and their structure.
13
25
Aim of Research in Bioinformatics
? Understand the functioning of living things – to “improve the quality of life”.
? Drug design? Identification of genetic risk factor? Gene therapy? Genetic modification of good crops and animals, etc
26
Current Issues and Applications
14
27
The Central Dogma of Information Flow in Biology
The sequence of amino acids making up a protein and hence its structure (folded state) and thus its function, is determined by transcription from DNA via RNA
DNA RNA Protein Function
28
3 Main Classes of Problem Areas
? Central dogma related: sequence, structure or function
? Data related: storage, retrieval & analysis (exponential growth of knowledge in molecular biology)
? Simulation of biological processes: protein folding (molecular dynamics) of metabolic pathways
15
29
Topics in Bioinformatics
? Sequence analysis? Sequence alignment? Structure and function prediction? Gene finding
? Structure analysis? Protein structure comparison? Protein structure prediction ? RNA structure modeling
? Expression analysis? Gen expression analysis? Gene clustering
? Pathway analysis? Metabolic pathway? Regulatory networks
30
Sequence Analysis
? Finding evolutionary relationships? Finding coding regions of genomic sequences? Translating DNA to protein? Finding regulatory regions? Assembling genome sequences
Finding information and patterns in DNA and protein data
16
31
Structure Analysis
? Amino acid sequences of protein determine its 3D conformation
MNIHRSTPITIARYGRSRNKTQDFEELSSIRSAEPSQSFSPNLGSPSPPETPNLSHCVSCIGKYLLLEPLEGDHVFRAVHLHSGEELVCKVFDISCYQESLAPCF
Sequence Structure Function
32
Gene Expression Analysis
Nature Genetics 21, 10 (1999)
17
33
Pathway Analysis
? The one of the declarative way representing biological knowledge
Metabolic pathway
34
Applications of Bioinformatics
? Drug design? Identification of genetic risk factors? Gene therapy? Genetic modification of food crops and animals? Biological warfare, crime etc.
? Personal Medicine?? E-Doctor?
18
35
Bioinformatics as Information Technology
InformationRetrieval
GenBank
SWISS-PROT
Hardware
Agent
MachineLearning
Algorithm
Supercomputing
Information filtering
Monitoring agent
Pattern recognitionClustering
Rule discovery
Sequence alignment
Biomedical text analysis
Database
Bioinformatics
36
Bioinformatics on the Web
sample
array
hybridization
scanner
relational
database
Data management
The experimental process
webinterface
image analysis results andsummaries
links to otherinformation resources
downloaddata to otherapplications
Data analysis and interpretation
19
37
Bioinformatics and Artificial Intelligence? A new application domain of AI and machine
learning? Data mining and knowledge discovery? Information filtering for scientists? Intelligent agents for customized data service
? A new basis for developing new AI technologies? “Biointelligence”? Biomolecular (DNA) computing? Molecular evolutionary algorithms
38
Machine Learning for Bioinformatics
20
39
Machine Learning Techniques for BioData Mining
? Sequence Alignment? Simulated Annealing? Genetic Algorithms
? Structure and Function Prediction? Hidden Markov Models? Multilayer Perceptrons? Decision Trees
? Molecular Clustering and Classification? Support Vector Machines? Nearest Neighbor Algorithms
? Expression (DNA Chip Data) Analysis: ? Self-Organizing Maps? Bayesian Networks
40
Pattern recognition and learning algorithms- Discriminant analysis- Hierarchical neural networks- Hidden Markov models- Formal grammar
Motif extractionFunctional site predictionCellular localization predictionCoding region predictionTransmembrane segment predictionProtein secondary structure predictionProtein 3D structure prediction
RNA secondary structure predictionRNA 3D structure predictionProtein 3D structure prediction
Structure/function prediction
Optimization algorithms- Dynamic programming- Simulated annealing- Genetic algorithms- Neural networks- Hidden Markov models
Pairwise sequence alignmentDatabase search for similar sequencesMultiple sequence alignmentPhylogenetic tree reconstructionProtein 3D structure alignment
Sequence alignment(homology search)
Machine Learning Methods
Problems in Biological Science
21
41
Expression (DNA Chip Data) Analysis
Clustering algorithms- Hierarchical cluster analysis- Kohonen neural networksClassification algorithms- Bayesian Networks- Neural Networks- Support Vector Machines- Decision Trees
Superfamily classificationOrtholog/paralog grouping of genes3D fold classification
Molecular Clustering/Classification
Machine Learning Methods
Problems in Biological Science
- Support Vector Machimes- Bayesian Networks - Latent Variable Models- Generative Topographic Mapping
42
Sequence Alignment
22
43
Sequence Alignment(Similarity Search)? Basic operation
? Comparison against each of the known examples stored in a primary database to detect any similarity that can be used for further reasoning
? Example
ATTGGCCA
| | | |A— GG— A
4+2*10=24
ATTGGCCA
| |AGG — — A
6+1*10=16
ATTGGCCA
| |AG — — — A
6+1*10=16
ATTGGCCA
A— — — GA
6+1*10=16
44
Simulated Annealingfor Multiple Sequence Alignment
? Metropolis Monte Carlo procedure is repeated at gradually decreasing temperature for energy minimization
x (e.g. all possible alignments)
E
???
??
???
???
otherwise )/exp(
0 when 1
)()'(
n
nn
TE
Ep
xExEE
23
45
Genetic Algorithms:Representation
? For sequence assembly? The sorted order representation
? Operators? A simple swap operation as the mutation operator? Permutation Crossover? Transposition operator? Inversion operator
4 2 1 5 3 Layout Final1 5 3 4 2 Layout teIntermedia4 2 3 1 5 Order Sort
3 11 6 9 2 14 Number Decimal0011|0011|1011|1001|0010|1110 Individual
5 4 3 2 1
startingposition
46
Structure and Function Prediction? Hidden Markov Models for Protein Modeling? Multilayer Perceptrons for Internal Exon Prediction:
GRAIL? Decision Trees for Gene Finding
24
47
Structure and Function Prediction
? Protein structureprediction
?Gene finding and gene prediction?Protein modeling
48
Hidden Markov Modelsfor Protein Modeling
? 20 alphabets (20 amino acids)? m0: start state, m5: end state, mk: match states? ik: insertion states, dk: deletion states? T(s2|s1): transition probabilities? P(x|mk): alphabet generating probabilities (x: letter: amino
acid)
25
49
Multilayer Perceptronsfor Internal Exon Prediction: GRAIL
Coding potential value
GC Composition
Length
Donor
Acceptor
Intron vocabulary
basesDiscrete
exon score
0
1
sequence
score
50
Coding and Non-coding Regions
DNA
Regulatory region Protein coding region
DNA -> RNA -> Protein
GENE
DNA
Non-codingregion
Non-codingregion
AUG TAA
26
51
Decision Trees for Gene Finding
?MORGAN: A decision tree system for gene finding. Coding and non-coding regions finding/exon finding
donor: donorsite score
d+a: donor andacceptorsite score
hex: in-framehexamer freq.
asym: Fickett’s
position assy-metry statistic
d+a<3.4?
d+a<1.3?
hex<16.3?
donor<0.0?
yes
(6,560)
(18,160)
(5,21) (23,16)
d+a<5.3?
hex<0.1?
(9,49)(142,73)
hex<-5.6?
asym<4.6?
(24,13) (1,5)
(737,50)
no by Markov Chains
52
Molecular Clustering and Classification
27
53
Molecular Clustering and Classification? Clustering (unsupervised learning)
? Hierarchical cluster analysis? Kohonen neural networks
? Classification (supervised learning)? Hidden Markov Model? Neural networks? Bayesian networks? Support vector machines? Nearest Neighbor Algorithm? Decision trees
54
Support Vector Machines for Functional Classification of Genes (1)
? Classification of microarray gene expression data [M. Brown, et al., PNAS, 97(1):262-267]
? Classifying gene functional class using gene expression data from DNA microarray hybridzation experiments? Dataset: 2467 genes, 79 experiments (2467x79 matrix)
1. Tricarboxylic-acid pathway2. Respiration chain complexes3. Cytoplasmic ribosomal proteins4. Proteasome5. Histones6. Helix-turn-helix
Functional classes defined from MYGD
121 Expression profiles of the cytoplasmicribosomal proteins. ( Similarity can be found! )
28
55
Cost = FP + 2FN
FLD: Fisher’s linear discriminant
C4.5 and MOC1: Decision trees
Parzen: Parzen windows (similar nonparametric density estimation technique)
Comparison of error rates for various classification methods on 4 classes
Support Vector Machinesfor Functional Classification of Genes (2)
56
? 3D shape similarity model by shape histograms [Ankerst, 1999]
Nearest Neighbor Algorithmsfor 3D Protein Classification
d(i,j): distance of the cells thatcorresponds to the bins i, j.
The cell distance is calculatedfrom the difference of the shellradii and the angles betweenthe sectors.
29
57
DNA Microarray Data Mining
58
Gene Expression Analysis
? DNA Microarray? Hybridize thousands of DNA samples of each gene on a glass with
special cDNA samples (made under two different conditions: background condition, experimental condition)
? Ratio of a gene: ratio of two expression levels of a gene
30
59
Spotted Microarray Chip
Nature Genetics 21, 15 (1999)
60
DNA Chip Technology
? Pin microarray methods? Inkjet methods? Photolithography methods? Electronic array methods
31
61
Application of DNA Microarrays
? Applications? Gene discovery: gene/mutated gene
• Growth, behavior, homeostasis …
? Disease diagnosis? Drug discovery: Pharmacogenomics? Toxicological research: Toxicogenomics
62
Computational Tools for DNA Microarrays?Major components
? LIMS (laboratory information management system)? Image processing? Data mining? Experiment design
?Major trends for the data analysis? Statistical methods? Machine learning? Reverse engineering
32
63
Diversity of Gene Expression
? Tissues? muscle, skin, liver, brain, …
? Developmental stages? embryonic, stem, adult cells…
? Clinical symptoms? liver cell, hepatoma, hepatitis, regeneration …
? Environmental factors? synthetic/natural chemicals, virus… .,…
64
Analysis of DNA Microarray DataPrevious Work
? Characteristics of data? Analysis of expression ratio based on each sample? Analysis of time-variant data
? Clustering? Self-organizing maps [Golub et al., 1999]? Singular value decomposition [Orly Alter et al., 2000]
? Classification? Support vector machines [Brown et al., 2000]
? Gene identification? Information theory [Stefanie et al., 2000]
? Gene modeling? Bayesian networks [Friedman et.al., 2000]
33
CAMDA-2000 Data Sets
66
CAMDA-2000 Data Sets
? CAMDA? Critical Assessment of Techniques for Microarray Data Mining? Purpose: Evaluate the data-mining techniques available to the
microarray community.
? Data Set 1? Identification of cell cycle-regulated genes? Yeast Sacchromyces cerevisiae by microarray hybridization.? Gene expression data with 6,278 genes.
? Data Set 2? Cancer class discovery and prediction by gene expression
monitoring.? Two types of cancers: acute myeloid leukemia (AML) and acute
lymphoblastic leukemia (ALL).? Gene expression data with 7,129 genes.
34
67
CAMDA-2000 Data Set 1Identification of Cell Cycle-regulated Genes of the Yeast by Microarray Hybridization
? Data given: gene expression levels of 6,278 genes spanned by time? ? Factor-based synchronization: every 7 minute
from 0 to 119 (18)? Cdc15-based synchronization: every 10 minute
from 10 to 290 (24)? Cdc28-based synchronization: every 10 minute
from 0 to 160 (17)? Elutriation (size-based synchronization): every
30 minutes from 0 to 390 (14)
? Among 6,278 genes? 104 genes are known to be cell-cycle
regulated• classified into: M/G1 boundary (19), late G1 SCB
regulated (14), late G1 MCB regulated (39), S-phase (8), S/G2 phase (9), G2/M phase (15).
? 250 cell cycle–regulated genes might exist
68
CAMDA-2000 Data Set 1Characteristics of data (? Factor-based Synchronization)
? M/G1 boundary
? Late G1 SCB regulated
? Late G1 MCB regulated
? S Phase
? S/G2 Phase
? G2/M Phase
35
69
CAMDA-2000 Data Set 2Cancer Class Discovery and Prediction by Gene Expression Monitoring
? Gene expression data for cancer prediction? Training data: 38 leukemia samples
(27 ALL , 11 AML)? Test data: 34 leukemia samples (20
ALL , 14 AML)? Datasets contain measurements
corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood.
? Graphical models used:? Bayesian networks? Non-negative matrix factorization ? Generative topographic mapping
70<Protocol> <Experimental Method>
36
71
Graphical Models for DNA Chip Data Mining
72
Classes of Graphical Models
Graphical Models
- Boltzmann Machines - Markov Random Fields
- Bayesian Networks- Latent Variable Models- Hidden Markov Models- Generative Topographic Mapping- Non-negative Matrix Factorization
Undirected Directed
37
73
Gene Expression Analysis
? Latent Variable Models? Bayesian Networks? Non-negative Matrix Factorization? Generative Topographic Mapping
74
Latent Variable Models Probabilistic Clustering - Model
,)(
)()|()|()(
i
kkiikki p
zpzpzpzp
gg
gg ??? ??j
kjijki vxsimilarity ),( vx
gi: ith genezk: k th clustertj: jth timep(gi|zk): generating probability
of ith gene given k th clustervk=p(t|zk): prototype of k th
cluster
??
''
jij
ijij x
xx
? ? ???
i j kkkik
jij
ij zpzpzpg
gztf ))|()|()(log(),,(
''
tgg objective function(maximized by EM)
38
75
Latent Variable Models Probabilistic Clustering – Learning
initialize p(zk), p(gi|zk), p(tj|zk) for i=1~N, j=1~M, k=1~K such that
while(until reach to max_iteration) do EM adaptation
//E-step
//M-step
end while//prototypesp(t|zk), k=1~K are prototypes for each cluster//clusteringgiven a gene gi, cluster of gi is k for which p(gi ? zk) is the biggest.
1)( ,1)|( ,1)|(111
??? ??????
K
kk
M
jkj
N
iki zpztpzp g
??
'''' )|()|()(
)|()|()(),|(
kkjkik
kjkikjik ztpzgpzp
ztpzgpzptgzp
? ? ?
? ??
''
'''
'
''
),|(
),|(
)|(
i jjik
jji
ji
jjik
jij
ij
ki
tgzpg
g
tgzpg
g
zp g
? ? ?
? ??
i jjik
jij
ij
ijik
jij
ij
kj
tgzpg
g
tgzpg
g
ztp
''
''''
'
''
),|(
),|(
)|(
,),|(1)(
''
? ? ??
i jjik
jij
ijk tgzp
gg
Rzp ? ? ?
?i j
jij
ij
gg
R
''
76
Latent Variable Models Probabilistic Clustering – Learning Curve
-784.5
-784
-783.5
-783
-782.5
-782
-781.5
1
34
67
100
133
166
199
232
265
298
331
364
397
430
463
496
529
562
595
628
661
694
727
760
793
826
859
892
925
958
991
1024
1057
1090
1123
1156
1189
1222
1255
1288
1321
1354
1387
1420
1453
1486
1519
1552
1585
1618
1651
1684
Number of iteration
obje
ctive f
unction v
alu
e
39
77
Latent Variable Models Probabilistic Clustering – Result
? Prototypes
? Clustering: Given a gene g i, the cluster of g i is k,
where k = argmaxm p(g i ? zm)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18
계열1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18
계열1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18
계열1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18
계열1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18
계열1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18
계열1
78
Bayesian Networks
40
79
Bayesian Networksfor Gene Expression Analysis (1) Feature Selection? Part of the data
? There are all 7,129 integer-valued attributes.? Attribute selection
? 10 attributes with the highest P values are selected.? Attribute value categorization
)/(|| 2121 ???? ???P
? ?10))min()/(max(_ ??? attributeattributevaluevaluedcategorize
80
Bayesian Networksfor Gene Expression Analysis (2)
Processeddata
Data
Preprocessing
Learningalgorithm
Gene C Gene B
Gene A
Target
Gene D
Gene C Gene B
Gene A
Target
Gene D
Gene C Gene B
Gene A
Target
Gene D
Gene C Gene B
Gene A
Target
Gene D
The values of Gene C and Gene B are given.
Belief propagation Probability for the target is computed.
? Learning
? Inference
41
81
Bayesian Networksfor Gene Expression Analysis (3) Structure Learning
FAHLeukotriene
Zyxin
C-mybLYN
CD33
DFLEPR
GB DEF
LiverTarget
82
Bayesian Networksfor Gene Expression Analysis (4) Learning ProcedureInput : gene expression data, D
Output : network structure G, local probability tables ?
Objective function : BDe score(G,? )
p(G): prior probability for the structure G
n: the number of nodes (attributes)
qi: the number of states of the parents of ith node
ri: the number of states of node i
? ijk: Dirichlet prior for node i at jth parents state and kth state
Nijk: the frequency of node i at jth parents state and kth state in Data D
Procedure
1. From Chow and Liu’s algorithm the network structure without edge
directions is learned(by mutual information).
2. Greedy hill-climbing search for maximum BDe score in the structure
learned in 1.
? ? ?? ? ? ?
???
??
????
n
i
q
j
r
k ijk
ijkijk
ijij
iji i N
NGpG
1 1 1 )(
)(
)(
)()(),(score BDe
?
?
?
?
??
?
?
k ijkij
k ijkij
NN allfor
allfor ??
42
83
Bayesian Networksfor Gene Expression Analysis (5) Inference
FAH Leukotriene
C-myb Target
??
?
?
?
?
F
F
F
LCFPFTP
LCFPLCFTP
LCFTPLCTP
),|()|(
),|(),,|(
),|,(),|(
? The Bayesian network constructed (partial)
? Given the values of C-myb and Leukotriene, the value of the Target can be inferred by
84
Bayesian Networksfor Gene Expression Analysis (6) Classification Results
? Prediction error of this Bayesian network (given all attribute values)
? The result can be improved by more appropriate data preprocessing.
4/348/34Test data
1/381/38Training data
Weighted votingBayesian network
43
85
Non-negative Matrix Factorization
86
Non-negative Matrix Factorization (1)
? Method? Using NMF for class clustering and prediction of gene expression
data from acute leukemia patients
? NMF (non-negative matrix factorization)
??
??
?r
aaiaii HW
1
)()( ??? WHG
WHG
G : gene expression data matrix
W : basis matrix (prototypes)
H : encoding matrix (in low
dimension)
0,, ??? aiai HWG
? NMF as a latent variable model
…
…
h1 hr
g1 g2 gn
W
Whg ???
h2
44
87
NMF (2) Clustering Gene Expression Data
….
.
.
.
.
.
.
.
.
.
.
.
7,129 genes
38 samples
x.
.
.
.
2 factors
… encoding
38 samples7,129 genes
G W(?) H(?)
? Factors can capture the correlations between the genes using the values of expression level.
? Cluster training samples into 2 groups by NMF? Assign each sample to the factor (class) which has higher encoding value.? Accuracy: 0 ~1 error for the training data set
…
H1·
g1 g2 g7,129
W
H2 ·
g3 g4
88
NMF (3) Learning ProcedureInput : Gene expression data matrix, G (n ? m)
Output : base matrix W (n ? k), encoding matrix H (k ? m)
n: data size, m: number of genes, k: number of latent variables
Objective function :
Procedure
1. Initialize W, H with random numbers.
2. Update W, H iteratively until max_iteration or some criterion is met.
0
1 ,0
?
?? ?
ij
jijij
H
WW
? ?? ?? ?
??n
i
m
iii WHWHGF1 1
)()log(?
???
??i i
iiaaa WH
GWHH
?
??? )(
?
?
?
?
jja
iaia
ai
iiaia
WW
W
HWH
GWW ?
? ?
?
)(
45
89
NMF (4) Learning Curve
Learning Curve
1 11 21 31 41 51 61 71 81 91 101
Number of iteration
Log
likel
ihoo
d
Log Likelihood
90
NMF (5) Clustering Result
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6
ALLAML
46
91
NMF (6)Diagnosis
? For each test sample g, estimate the encoding vector h that best approximates the sample.? W is the basis matrix computed during training (fixed).? As in training, assign each sample to the factor (class) which has the
highest encoding value.
? Accuracy: 1~2 error(s) for the test data set
.
.
.
.
2 factors
x.
.
W
h(?)
g
7,129 genes
7,129 genes
…
h1
g1 g2 g7,129
W
h2
g3 g4
92
Generative Topographic Mapping
47
93
Generative Topographic Mapping (1)
? GTM: a nonlinear, parametric mapping y(x;W)from a latent space to a data space.
<Latent Space> <Data Space>
x2
x1
y(x;W)
t1
t3
t2
Grid
94
GTM (2) Learning Algorithm (EM)
? Generate the grid of latent points.? Generate the grid of latent function centers.? Compute the matrix of basis function activations ? .? Initialize weights W in Y = ? W and the noise variance ? .? Compute ? n,k = ||tn – ? kW||2 = ||tn – yk(x,W)||2 for each n, k? Repeat
- Compute the responsibility matrix R using ? and ? . [E-Step] Compute G=RTR
- Update W by ? TG? W= ? TRT [M-step ] - Compute ? = ||t – ? W||2
- Update ?? Until convergence
48
95
GTM (3)Visualization
? Posterior distribution in latent space given a data point t:
X(t) ~
? For a whole set of data: for each t, plot in the latent space? Posterior mode:
? Posterior mean:
96
GTM (4) Clustering Experiment
? Gene Selection? Select about 50 genes out of 7,129 based on the three
test scores of cancer diagnosis.• Correlation metric (similar as t-test)• Wilcoxon test scores (a nonparametric t-test)• Median test scores (a nonparametric t-test)
? Clustering & Visualization? After learning a model, genes are plotted in the latent
space.? With the mapping in the latent space, clusters can be
identified.
49
97
?List of Genes Selected
98
GTM (5)Learning Curve
50
99
GTM (6)Clustering Result
Genes with high expression levels in case of ALL (large P-metric value)
Genes with high expression levels in case of AML (negative large P-mertic value)
100
Summary
? Challenges of Machine Learning Applied to Biosciences? Huge data size? Noise and data sparseness? Unlabeled and imbalanced data
? Biosciences for Machine Learning? New application? Biosystems are existence proofs for ideal AI systems? Provides interesting metaphors and algorithms!? Synergy effects between biosciences and artificial intelligence