machine learning for biological data mining · 2015-12-16 · machine learning techniques for bio...

1

Machine Learning for Biological Data Mining

장 병 탁

서울대 컴퓨터공학부E-mail: [email protected]://scai.snu.ac.kr./~btzhang/

Byoung-Tak ZhangSchool of Computer Science and Engineering

Seoul National University

This material is available at http://scai.snu.ac.kr/~btzhang/

2

Outline

? Basics in Molecular Biology

? Current Issues and Applications

?Machine Learning for Bioinformatics

? DNA Chip Data Mining

? Graphical Models for Gene Expression Analysis

? Summary

2

3

What is Bioinformatics?

? Bio – molecular biology? Informatics – computer science? BioInformatics – solving problems arising from

biology using methodology from computer science.

? BioInformatics vs. Computationl Biology

4

Basics in Molecular Biology

3

5

DNA Double Helix

6

DNA Base-pairs

4

7

DNA

AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGCATGCAATCAGTCCCGTTGCTTCGGCACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGCTATTGTACCCGTTGCTTCGGATCTCTTGGGGATCTCTTGGTTCCGGCATGCAATCAGTCCCGTTGCTTCGGCACTGTCTGAAAGCGCCTTTGGGCCCAACCTCCCACCGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGGCCGCCGGGGGCACTGTCTGAAAGCTCGGCCGCC

8

Human Genome Sequenced!

? “The most wondrous map ever produced by human kind”

? Scientists jointly announced that they had obtained a near complete set of the biochemical instructions for human life.

? “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”

5

9

Some Facts

? DNA differs between humans by 0.2%, (1 in 500 bases).

? Human DNA is 98% identical to that of chimpanzees.

? 97% of DNA in the human genome has no known function.

? 3.109 letters in the DNA code in every cell in your body.

? 1014 cells in the body.? 12,000 letters of DNA decoded by the Human

Genome Project every second.

10

Molecular Biology: Flow of Information

DNA RNA Protein Function

DNA Protein

AC

TG

GA

AGGTGTGC

PheCysLysCysAspCysArgSerA

laLeu

6

11

Using the Genome

? Redundancy in genetic information

? Single genes have multiple functions

? Genes 1-D, gene products 3-D

Genetic Information

Molecular Structure

Biochemical Function

Biological Behavior

Molecular Dynamics

Biophysics

Biochemistry

12

Gene Structure

DNA “gene” search

RNA

Proteinsequence

FoldedProtein

compute

compute

?how?

7

13

DNA (gene) RNA Protein

controlstatement

TATA start

Terminationstop

controlstatement

Ribosomebinding

gene

Transcription (RNA polymerase)

mRNA

Protein

Transcription (Ribosome)

5’ utr 3’ utr

14

Numbers of Genes

? Humans 25,000 - 40,000

? C. elegans (worm): 19,000

? S. cerevisiae (yeast) 6,000

? Tuberculosis microbe 4,000

8

15

Genetic Code: 3 bases=1amino acid

FirstPosition(5’end)

T

C

A

G

T C A G

Second positionThirdPosition(3’end)

TC

AG

TC

AG

TC

AG

TC

A

G

PhePhe

LeuLeu

LeuLeu

Leu

Leu

llelle

Lle

Met

ValVal

ValVal

AlaAla

AlaAla

ThrThr

Thr

Thr

ProPro

ProPro

SerSer

Ser

Ser

TyrTyr

STOP

STOPHisHis

GlnGln

AsnAsn

Lys

Lys

AspAsp

Glu

Giu

CysCys

STOPTrp

ArgArg

Arg

Arg

SerSer

Arg

Arg

GlyGly

Gly

Gly

16

Nucleotide Sequence

aacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc

gcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgcaacctgcgga aggatcattaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg cggagacccc gcgggcccgc cgcttgtcggccgccggggg ggcgcctctg

cgcttgtcgg ccgccgggggccccccgggc ccgtgcccgccggagacccc aacacgaacactgtctgaaa gcgtgcagtctgagttgatt gaatgcaatcagttaaaact ttcaacaatggatctcttgg aacctgcggaccgagtgcgg gtcctttgggcccaacctcc catccgtgtctattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgagttaaaact ttcaacaatggatctcttgg ttccggctgc tattgtaccc tgttgcttcggcgggcccgc cgcttgtcggccgccggggg ggcgcctctgccccccgggc ccgtgcccgccggagacccc tgttgcttcg

SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other

9

17

Protein (Amino Acid) Sequence

CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR IMGKVATRAS SKGVKSTLGT RGALENISNV ARNNLQAGAKKELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSALEAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPSEIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC

18

Protein Structure

10

19

Human Genetic Variations(Single Nucleotide Polymorphisms)? SNP’s- “genetic individuality”? ~1/1000 bases variable (2 humans)?Make us more/less susceptible to diseases?May influence the effect of drug treatments

TTTGCTCCGTTTTCA

TTTGCTCYGTTTTCA

TTTGCTCTGTTTTCA

20

SNP (Single Nucleotide Polymorphism)? Finding single nucleotide changes at specific regions of

genes

?Diagnosis of hereditary diseases?Personal drug?Finding more effective drugs and

treatments

11

21

Human Individuality

22

Flood of Data! (SWISS-PROT)

1988 1990 1992 1994 1996

80

70

60

50

40

30

20

10

0

Year of release

Num

ber

of se

que

nces

x 1

000

12

23

How Can We Analyze the Flood of Data??Data: don’t just store it, analyze it! By

comparing sequences, one can find out about things like? ancestors of organisms? phylogenetic trees? protein structures? protein function

24

Bioinformatics Is About:

? Elicitation of DNA sequences from genetic material

? Sequence annotation (e.g. with information from experiments)

? Understanding the control of gene expression (i.e. under what circumstances proteins are transcribed from DNA)

? The relationship between the amino acid sequence of proteins and their structure.

13

25

Aim of Research in Bioinformatics

? Understand the functioning of living things – to “improve the quality of life”.

? Drug design? Identification of genetic risk factor? Gene therapy? Genetic modification of good crops and animals, etc

26

Current Issues and Applications

14

27

The Central Dogma of Information Flow in Biology

The sequence of amino acids making up a protein and hence its structure (folded state) and thus its function, is determined by transcription from DNA via RNA

DNA RNA Protein Function

28

3 Main Classes of Problem Areas

? Central dogma related: sequence, structure or function

? Data related: storage, retrieval & analysis (exponential growth of knowledge in molecular biology)

? Simulation of biological processes: protein folding (molecular dynamics) of metabolic pathways

15

29

Topics in Bioinformatics

? Sequence analysis? Sequence alignment? Structure and function prediction? Gene finding

? Structure analysis? Protein structure comparison? Protein structure prediction ? RNA structure modeling

? Expression analysis? Gen expression analysis? Gene clustering

? Pathway analysis? Metabolic pathway? Regulatory networks

30

Sequence Analysis

? Finding evolutionary relationships? Finding coding regions of genomic sequences? Translating DNA to protein? Finding regulatory regions? Assembling genome sequences

Finding information and patterns in DNA and protein data

16

31

Structure Analysis

? Amino acid sequences of protein determine its 3D conformation

MNIHRSTPITIARYGRSRNKTQDFEELSSIRSAEPSQSFSPNLGSPSPPETPNLSHCVSCIGKYLLLEPLEGDHVFRAVHLHSGEELVCKVFDISCYQESLAPCF

Sequence Structure Function

32

Gene Expression Analysis

Nature Genetics 21, 10 (1999)

17

33

Pathway Analysis

? The one of the declarative way representing biological knowledge

Metabolic pathway

34

Applications of Bioinformatics

? Drug design? Identification of genetic risk factors? Gene therapy? Genetic modification of food crops and animals? Biological warfare, crime etc.

? Personal Medicine?? E-Doctor?

18

35

Bioinformatics as Information Technology

InformationRetrieval

GenBank

SWISS-PROT

Hardware

Agent

MachineLearning

Algorithm

Supercomputing

Information filtering

Monitoring agent

Pattern recognitionClustering

Rule discovery

Sequence alignment

Biomedical text analysis

Database

Bioinformatics

36

Bioinformatics on the Web

sample

array

hybridization

scanner

relational

database

Data management

The experimental process

webinterface

image analysis results andsummaries

links to otherinformation resources

downloaddata to otherapplications

Data analysis and interpretation

19

37

Bioinformatics and Artificial Intelligence? A new application domain of AI and machine

learning? Data mining and knowledge discovery? Information filtering for scientists? Intelligent agents for customized data service

? A new basis for developing new AI technologies? “Biointelligence”? Biomolecular (DNA) computing? Molecular evolutionary algorithms

38

Machine Learning for Bioinformatics

20

39

Machine Learning Techniques for BioData Mining

? Sequence Alignment? Simulated Annealing? Genetic Algorithms

? Structure and Function Prediction? Hidden Markov Models? Multilayer Perceptrons? Decision Trees

? Molecular Clustering and Classification? Support Vector Machines? Nearest Neighbor Algorithms

? Expression (DNA Chip Data) Analysis: ? Self-Organizing Maps? Bayesian Networks

40

Pattern recognition and learning algorithms- Discriminant analysis- Hierarchical neural networks- Hidden Markov models- Formal grammar

Motif extractionFunctional site predictionCellular localization predictionCoding region predictionTransmembrane segment predictionProtein secondary structure predictionProtein 3D structure prediction

RNA secondary structure predictionRNA 3D structure predictionProtein 3D structure prediction

Structure/function prediction

Optimization algorithms- Dynamic programming- Simulated annealing- Genetic algorithms- Neural networks- Hidden Markov models

Pairwise sequence alignmentDatabase search for similar sequencesMultiple sequence alignmentPhylogenetic tree reconstructionProtein 3D structure alignment

Sequence alignment(homology search)

Machine Learning Methods

Problems in Biological Science

21

41

Expression (DNA Chip Data) Analysis

Clustering algorithms- Hierarchical cluster analysis- Kohonen neural networksClassification algorithms- Bayesian Networks- Neural Networks- Support Vector Machines- Decision Trees

Superfamily classificationOrtholog/paralog grouping of genes3D fold classification

Molecular Clustering/Classification

Machine Learning Methods

Problems in Biological Science

- Support Vector Machimes- Bayesian Networks - Latent Variable Models- Generative Topographic Mapping

42

Sequence Alignment

22

43

Sequence Alignment(Similarity Search)? Basic operation

? Comparison against each of the known examples stored in a primary database to detect any similarity that can be used for further reasoning

? Example

ATTGGCCA

| | | |A— GG— A

4+2*10=24

ATTGGCCA

| |AGG — — A

6+1*10=16

ATTGGCCA

| |AG — — — A

6+1*10=16

ATTGGCCA

A— — — GA

6+1*10=16

44

Simulated Annealingfor Multiple Sequence Alignment

? Metropolis Monte Carlo procedure is repeated at gradually decreasing temperature for energy minimization

x (e.g. all possible alignments)

E

???

??

???

???

otherwise )/exp(

0 when 1

)()'(

n

nn

TE

Ep

xExEE

23

45

Genetic Algorithms:Representation

? For sequence assembly? The sorted order representation

? Operators? A simple swap operation as the mutation operator? Permutation Crossover? Transposition operator? Inversion operator

4 2 1 5 3 Layout Final1 5 3 4 2 Layout teIntermedia4 2 3 1 5 Order Sort

3 11 6 9 2 14 Number Decimal0011|0011|1011|1001|0010|1110 Individual

5 4 3 2 1

startingposition

46

Structure and Function Prediction? Hidden Markov Models for Protein Modeling? Multilayer Perceptrons for Internal Exon Prediction:

GRAIL? Decision Trees for Gene Finding

24

47

Structure and Function Prediction

? Protein structureprediction

?Gene finding and gene prediction?Protein modeling

48

Hidden Markov Modelsfor Protein Modeling

? 20 alphabets (20 amino acids)? m0: start state, m5: end state, mk: match states? ik: insertion states, dk: deletion states? T(s2|s1): transition probabilities? P(x|mk): alphabet generating probabilities (x: letter: amino

acid)

25

49

Multilayer Perceptronsfor Internal Exon Prediction: GRAIL

Coding potential value

GC Composition

Length

Donor

Acceptor

Intron vocabulary

basesDiscrete

exon score

0

1

sequence

score

50

Coding and Non-coding Regions

DNA

Regulatory region Protein coding region

DNA -> RNA -> Protein

GENE

DNA

Non-codingregion

Non-codingregion

AUG TAA

26

51

Decision Trees for Gene Finding

?MORGAN: A decision tree system for gene finding. Coding and non-coding regions finding/exon finding

donor: donorsite score

d+a: donor andacceptorsite score

hex: in-framehexamer freq.

asym: Fickett’s

position assy-metry statistic

d+a<3.4?

d+a<1.3?

hex<16.3?

donor<0.0?

yes

(6,560)

(18,160)

(5,21) (23,16)

d+a<5.3?

hex<0.1?

(9,49)(142,73)

hex<-5.6?

asym<4.6?

(24,13) (1,5)

(737,50)

no by Markov Chains

52

Molecular Clustering and Classification

27

53

Molecular Clustering and Classification? Clustering (unsupervised learning)

? Hierarchical cluster analysis? Kohonen neural networks

? Classification (supervised learning)? Hidden Markov Model? Neural networks? Bayesian networks? Support vector machines? Nearest Neighbor Algorithm? Decision trees

54

Support Vector Machines for Functional Classification of Genes (1)

? Classification of microarray gene expression data [M. Brown, et al., PNAS, 97(1):262-267]

? Classifying gene functional class using gene expression data from DNA microarray hybridzation experiments? Dataset: 2467 genes, 79 experiments (2467x79 matrix)

1. Tricarboxylic-acid pathway2. Respiration chain complexes3. Cytoplasmic ribosomal proteins4. Proteasome5. Histones6. Helix-turn-helix

Functional classes defined from MYGD

121 Expression profiles of the cytoplasmicribosomal proteins. ( Similarity can be found! )

28

55

Cost = FP + 2FN

FLD: Fisher’s linear discriminant

C4.5 and MOC1: Decision trees

Parzen: Parzen windows (similar nonparametric density estimation technique)

Comparison of error rates for various classification methods on 4 classes

Support Vector Machinesfor Functional Classification of Genes (2)

56

? 3D shape similarity model by shape histograms [Ankerst, 1999]

Nearest Neighbor Algorithmsfor 3D Protein Classification

d(i,j): distance of the cells thatcorresponds to the bins i, j.

The cell distance is calculatedfrom the difference of the shellradii and the angles betweenthe sectors.

29

57

DNA Microarray Data Mining

58


? DNA Microarray? Hybridize thousands of DNA samples of each gene on a glass with

special cDNA samples (made under two different conditions: background condition, experimental condition)

? Ratio of a gene: ratio of two expression levels of a gene

30

59

Spotted Microarray Chip

Nature Genetics 21, 15 (1999)

60

DNA Chip Technology

? Pin microarray methods? Inkjet methods? Photolithography methods? Electronic array methods

31

61

Application of DNA Microarrays

? Applications? Gene discovery: gene/mutated gene

• Growth, behavior, homeostasis …

? Disease diagnosis? Drug discovery: Pharmacogenomics? Toxicological research: Toxicogenomics

62

Computational Tools for DNA Microarrays?Major components

? LIMS (laboratory information management system)? Image processing? Data mining? Experiment design

?Major trends for the data analysis? Statistical methods? Machine learning? Reverse engineering

32

63

Diversity of Gene Expression

? Tissues? muscle, skin, liver, brain, …

? Developmental stages? embryonic, stem, adult cells…

? Clinical symptoms? liver cell, hepatoma, hepatitis, regeneration …

? Environmental factors? synthetic/natural chemicals, virus… .,…

64

Analysis of DNA Microarray DataPrevious Work

? Characteristics of data? Analysis of expression ratio based on each sample? Analysis of time-variant data

? Clustering? Self-organizing maps [Golub et al., 1999]? Singular value decomposition [Orly Alter et al., 2000]

? Classification? Support vector machines [Brown et al., 2000]

? Gene identification? Information theory [Stefanie et al., 2000]

? Gene modeling? Bayesian networks [Friedman et.al., 2000]

33

CAMDA-2000 Data Sets

66

CAMDA-2000 Data Sets

? CAMDA? Critical Assessment of Techniques for Microarray Data Mining? Purpose: Evaluate the data-mining techniques available to the

microarray community.

? Data Set 1? Identification of cell cycle-regulated genes? Yeast Sacchromyces cerevisiae by microarray hybridization.? Gene expression data with 6,278 genes.

? Data Set 2? Cancer class discovery and prediction by gene expression

monitoring.? Two types of cancers: acute myeloid leukemia (AML) and acute

lymphoblastic leukemia (ALL).? Gene expression data with 7,129 genes.

34

67

CAMDA-2000 Data Set 1Identification of Cell Cycle-regulated Genes of the Yeast by Microarray Hybridization

? Data given: gene expression levels of 6,278 genes spanned by time? ? Factor-based synchronization: every 7 minute

from 0 to 119 (18)? Cdc15-based synchronization: every 10 minute

from 10 to 290 (24)? Cdc28-based synchronization: every 10 minute

from 0 to 160 (17)? Elutriation (size-based synchronization): every

30 minutes from 0 to 390 (14)

? Among 6,278 genes? 104 genes are known to be cell-cycle

regulated• classified into: M/G1 boundary (19), late G1 SCB

regulated (14), late G1 MCB regulated (39), S-phase (8), S/G2 phase (9), G2/M phase (15).

? 250 cell cycle–regulated genes might exist

68

CAMDA-2000 Data Set 1Characteristics of data (? Factor-based Synchronization)

? M/G1 boundary

? Late G1 SCB regulated

? Late G1 MCB regulated

? S Phase

? S/G2 Phase

? G2/M Phase

35

69

CAMDA-2000 Data Set 2Cancer Class Discovery and Prediction by Gene Expression Monitoring

? Gene expression data for cancer prediction? Training data: 38 leukemia samples

(27 ALL , 11 AML)? Test data: 34 leukemia samples (20

ALL , 14 AML)? Datasets contain measurements

corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood.

? Graphical models used:? Bayesian networks? Non-negative matrix factorization ? Generative topographic mapping

70<Protocol> <Experimental Method>

36

71

Graphical Models for DNA Chip Data Mining

72

Classes of Graphical Models

Graphical Models

- Boltzmann Machines - Markov Random Fields

- Bayesian Networks- Latent Variable Models- Hidden Markov Models- Generative Topographic Mapping- Non-negative Matrix Factorization

Undirected Directed

37

73


? Latent Variable Models? Bayesian Networks? Non-negative Matrix Factorization? Generative Topographic Mapping

74

Latent Variable Models Probabilistic Clustering - Model

,)(

)()|()|()(

i

kkiikki p

zpzpzpzp

gg

gg ??? ??j

kjijki vxsimilarity ),( vx

gi: ith genezk: k th clustertj: jth timep(gi|zk): generating probability

of ith gene given k th clustervk=p(t|zk): prototype of k th

cluster

??

''

jij

ijij x

xx

? ? ???

i j kkkik

jij

ij zpzpzpg

gztf ))|()|()(log(),,(

''

tgg objective function(maximized by EM)

38

75

Latent Variable Models Probabilistic Clustering – Learning

initialize p(zk), p(gi|zk), p(tj|zk) for i=1~N, j=1~M, k=1~K such that

while(until reach to max_iteration) do EM adaptation

//E-step

//M-step

end while//prototypesp(t|zk), k=1~K are prototypes for each cluster//clusteringgiven a gene gi, cluster of gi is k for which p(gi ? zk) is the biggest.

1)( ,1)|( ,1)|(111

??? ??????

K

kk

M

jkj

N

iki zpztpzp g

??

'''' )|()|()(

)|()|()(),|(

kkjkik

kjkikjik ztpzgpzp

ztpzgpzptgzp

? ? ?

? ??

''

'''

'

''

),|(

),|(

)|(

i jjik

jji

ji

jjik

jij

ij

ki

tgzpg

g

tgzpg

g

zp g

? ? ?

? ??

i jjik

jij

ij

ijik

jij

ij

kj

tgzpg

g

tgzpg

g

ztp

''

''''

'

''

),|(

),|(

)|(

,),|(1)(

''

? ? ??

i jjik

jij

ijk tgzp

gg

Rzp ? ? ?

?i j

jij

ij

gg

R

''

76

Latent Variable Models Probabilistic Clustering – Learning Curve

-784.5

-784

-783.5

-783

-782.5

-782

-781.5

1

34

67

100

133

166

199

232

265

298

331

364

397

430

463

496

529

562

595

628

661

694

727

760

793

826

859

892

925

958

991

1024

1057

1090

1123

1156

1189

1222

1255

1288

1321

1354

1387

1420

1453

1486

1519

1552

1585

1618

1651

1684

Number of iteration

obje

ctive f

unction v

alu

e

39

77

Latent Variable Models Probabilistic Clustering – Result

? Prototypes

? Clustering: Given a gene g i, the cluster of g i is k,

where k = argmaxm p(g i ? zm)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18

계열1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18

계열1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18

계열1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18

계열1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18

계열1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 14 15 16 1 7 18

계열1

78

Bayesian Networks

40

79

Bayesian Networksfor Gene Expression Analysis (1) Feature Selection? Part of the data

? There are all 7,129 integer-valued attributes.? Attribute selection

? 10 attributes with the highest P values are selected.? Attribute value categorization

)/(|| 2121 ???? ???P

? ?10))min()/(max(_ ??? attributeattributevaluevaluedcategorize

80

Bayesian Networksfor Gene Expression Analysis (2)

Processeddata

Data

Preprocessing

Learningalgorithm

Gene C Gene B

Gene A

Target

Gene D

Gene C Gene B

Gene A

Target

Gene D

Gene C Gene B

Gene A

Target

Gene D

Gene C Gene B

Gene A

Target

Gene D

The values of Gene C and Gene B are given.

Belief propagation Probability for the target is computed.

? Learning

? Inference

41

81

Bayesian Networksfor Gene Expression Analysis (3) Structure Learning

FAHLeukotriene

Zyxin

C-mybLYN

CD33

DFLEPR

GB DEF

LiverTarget

82

Bayesian Networksfor Gene Expression Analysis (4) Learning ProcedureInput : gene expression data, D

Output : network structure G, local probability tables ?

Objective function : BDe score(G,? )

p(G): prior probability for the structure G

n: the number of nodes (attributes)

qi: the number of states of the parents of ith node

ri: the number of states of node i

? ijk: Dirichlet prior for node i at jth parents state and kth state

Nijk: the frequency of node i at jth parents state and kth state in Data D

Procedure

1. From Chow and Liu’s algorithm the network structure without edge

directions is learned(by mutual information).

2. Greedy hill-climbing search for maximum BDe score in the structure

learned in 1.

? ? ?? ? ? ?

???

??

????

n

i

q

j

r

k ijk

ijkijk

ijij

iji i N

NGpG

1 1 1 )(

)(

)(

)()(),(score BDe

?

?

?

?

??

?

?

k ijkij

k ijkij

NN allfor

allfor ??

42

83

Bayesian Networksfor Gene Expression Analysis (5) Inference

FAH Leukotriene

C-myb Target

??

?

?

?

?

F

F

F

LCFPFTP

LCFPLCFTP

LCFTPLCTP

),|()|(

),|(),,|(

),|,(),|(

? The Bayesian network constructed (partial)

? Given the values of C-myb and Leukotriene, the value of the Target can be inferred by

84

Bayesian Networksfor Gene Expression Analysis (6) Classification Results

? Prediction error of this Bayesian network (given all attribute values)

? The result can be improved by more appropriate data preprocessing.

4/348/34Test data

1/381/38Training data

Weighted votingBayesian network

43

85

Non-negative Matrix Factorization

86

Non-negative Matrix Factorization (1)

? Method? Using NMF for class clustering and prediction of gene expression

data from acute leukemia patients

? NMF (non-negative matrix factorization)

??

??

?r

aaiaii HW

1

)()( ??? WHG

WHG

G : gene expression data matrix

W : basis matrix (prototypes)

H : encoding matrix (in low

dimension)

0,, ??? aiai HWG

? NMF as a latent variable model

…

…

h1 hr

g1 g2 gn

W

Whg ???

h2

44

87

NMF (2) Clustering Gene Expression Data

….

.

.

.

.

.

.

.

.

.

.

.

7,129 genes

38 samples

x.

.

.

.

2 factors

… encoding

38 samples7,129 genes

G W(?) H(?)

? Factors can capture the correlations between the genes using the values of expression level.

? Cluster training samples into 2 groups by NMF? Assign each sample to the factor (class) which has higher encoding value.? Accuracy: 0 ~1 error for the training data set

…

H1·

g1 g2 g7,129

W

H2 ·

g3 g4

88

NMF (3) Learning ProcedureInput : Gene expression data matrix, G (n ? m)

Output : base matrix W (n ? k), encoding matrix H (k ? m)

n: data size, m: number of genes, k: number of latent variables

Objective function :

Procedure

1. Initialize W, H with random numbers.

2. Update W, H iteratively until max_iteration or some criterion is met.

0

1 ,0

?

?? ?

ij

jijij

H

WW

? ?? ?? ?

??n

i

m

iii WHWHGF1 1

)()log(?

???

??i i

iiaaa WH

GWHH

?

??? )(

?

?

?

?

jja

iaia

ai

iiaia

WW

W

HWH

GWW ?

? ?

?

)(

45

89

NMF (4) Learning Curve

Learning Curve

1 11 21 31 41 51 61 71 81 91 101

Number of iteration

Log

likel

ihoo

d

Log Likelihood

90

NMF (5) Clustering Result

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

ALLAML

46

91

NMF (6)Diagnosis

? For each test sample g, estimate the encoding vector h that best approximates the sample.? W is the basis matrix computed during training (fixed).? As in training, assign each sample to the factor (class) which has the

highest encoding value.

? Accuracy: 1~2 error(s) for the test data set

.

.

.

.

2 factors

x.

.

W

h(?)

g

7,129 genes

7,129 genes

…

h1

g1 g2 g7,129

W

h2

g3 g4

92

Generative Topographic Mapping

47

93

Generative Topographic Mapping (1)

? GTM: a nonlinear, parametric mapping y(x;W)from a latent space to a data space.

<Latent Space> <Data Space>

x2

x1

y(x;W)

t1

t3

t2

Grid

94

GTM (2) Learning Algorithm (EM)

? Generate the grid of latent points.? Generate the grid of latent function centers.? Compute the matrix of basis function activations ? .? Initialize weights W in Y = ? W and the noise variance ? .? Compute ? n,k = ||tn – ? kW||2 = ||tn – yk(x,W)||2 for each n, k? Repeat

- Compute the responsibility matrix R using ? and ? . [E-Step] Compute G=RTR

- Update W by ? TG? W= ? TRT [M-step ] - Compute ? = ||t – ? W||2

- Update ?? Until convergence

48

95

GTM (3)Visualization

? Posterior distribution in latent space given a data point t:

X(t) ~

? For a whole set of data: for each t, plot in the latent space? Posterior mode:

? Posterior mean:

96

GTM (4) Clustering Experiment

? Gene Selection? Select about 50 genes out of 7,129 based on the three

test scores of cancer diagnosis.• Correlation metric (similar as t-test)• Wilcoxon test scores (a nonparametric t-test)• Median test scores (a nonparametric t-test)

? Clustering & Visualization? After learning a model, genes are plotted in the latent

space.? With the mapping in the latent space, clusters can be

identified.

49

97

?List of Genes Selected

98

GTM (5)Learning Curve

50

99

GTM (6)Clustering Result

Genes with high expression levels in case of ALL (large P-metric value)

Genes with high expression levels in case of AML (negative large P-mertic value)

100

Summary

? Challenges of Machine Learning Applied to Biosciences? Huge data size? Noise and data sparseness? Unlabeled and imbalanced data

? Biosciences for Machine Learning? New application? Biosystems are existence proofs for ideal AI systems? Provides interesting metaphors and algorithms!? Synergy effects between biosciences and artificial intelligence

machine learning for biological data mining · 2015-12-16 · machine learning techniques for bio...

Documents