machine learning in bioinformatics 2008. 2. 21. rhee, je-keun

26
Machine Learning in Machine Learning in Bioinformatics Bioinformatics 2008. 2. 21. Rhee, Je-Keun

Upload: egbert-lawson

Post on 02-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Machine Learning in BioinformaticsMachine Learning in Bioinformatics

2008. 2. 21.

Rhee, Je-Keun

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

2

Biological Background: Central DogmaBiological Background: Central Dogma

http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/central_dogma.html

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

3

DNA, RNA, ProteinDNA, RNA, Protein

Eg. DNA SequenceAGGATTTAGAACAAAATCCGAAAAGGAGTGACATAACATTACAACATTAGGAATAAAGTAGATAAAACATTGATCAAAGGAAATTTAGTTATAGTTGAAAATTTTTATTATAAAAAGGGAACGAAGGGAGATTTTTTCAAGGGCATTTTGGTCCACCCTCTTGAGTTTTCCAGTTGTTGTAGCAGGAGCAAACTTGTTTGTTCCCATAGTAACCCGGAGGCACACAGAGACACTTCCTGCAGCATTTGTTGCAGAACGTAATGCAAGCCTTGTGGTACTGTGTCTTTTTACACCTCCTATCACATTCCGATGGGCATTGGGTACGTTTCCAGGCTTCCTGGGTCCATAACCGTTTCTGGCTCCAACTTCAACATTAGATCCACCTTGGAGGCCATAACCATGGGTTTGAAAGCATGAAGAGGGGCAATGAAGGGCCAAGAGGNAGATAGNCCCATATGGCCTANNCATTTCCAGGTTTGGGGNATTGGTATCCAAAGACCAACAACCCCCCAAACCCCCCAAACAGGTTTAGCCCCTTGGGG

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

4

Main topics in BioinformaticsMain topics in Bioinformatics

Biological data analysis Sequence Analysis (DNA, RNA, Protein, ...) Structural Bioinformatics (Protein structure, RNA structure, ...) Gene Expression Systems Biology Text mining Etc.

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

5

Machine learning methods for biological data Machine learning methods for biological data analysisanalysis

Feature Selection

Classification Decision Trees, Artificial Neural Networks, Bayesian Networks,

Support Vector Machines, k-NN, etc.

Clustering k-means, Hierarchical Clustering, PCA, SOM, etc.

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

6

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

7

Gene findingGene finding

GRAIL

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

8

Protein structure predictionProtein structure prediction

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

9

Protein structure predictionProtein structure prediction

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

10

Protein structure predictionProtein structure prediction

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

11

Gene expression analysisGene expression analysis

microarray

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

12

Data preparationData preparation

Sample 1 Sample 2 Sample i

Sample k Sample n

<Microarray image samples>

Sample 1

Gene 2

Image analysis

<Numerical data for data mining>

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

13

An Example: Hierarchical ClusteringAn Example: Hierarchical Clustering

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

14

Example: Bayesian network classifierExample: Bayesian network classifier

Zyxin

Leukemia

MB-1

C-mybLTC4S

<Network structure>

Zyxin Zyxin

Leukemia ALL or AML

LTC4S Leukotriene C4 synthase (LTC4S) gene

C-myb C-myb gene extracted from Human (c-myb) gene, complete primary cds, and five complete alternatively spliced cds

MB-1 MB-1 gene

n

i iiXPP1

)|()( PaX

A Bayesian network classifier for acute leukemias [Hwang et al. 2001]

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

15

Promoter PredictionPromoter Prediction

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

16

microRNA prediction using a probabilistic leamicroRNA prediction using a probabilistic learning model rning model

Probabilistic co-learning model

최초의 기계학습 방법을 이용한 miRNA 예측 : 범용의 miRNA 예측 알고리즘

Human,mouse,fly 등 8 종의 miRNA 예측 및 annotated genome browser 제공

최초의 기계학습 방법을 이용한 miRNA 예측 : 범용의 miRNA 예측 알고리즘

Human,mouse,fly 등 8 종의 miRNA 예측 및 annotated genome browser 제공

TT TT TTT TF

MMMMMMMMMMMMMMMMM I IMNMMMD DDD DDDDDDMMMMMMNI

T T T T T T T T T T T T TF TF F FFF FF FF FF FF FF FF F

M- N-

I- D-

M- N-

I- D-

M- N-

I- D-

M- N-

I- D-

M- N-

I- D-

M- N-

I- DS

T

F

T

F

T

F

T

F

T

F

T

F

G

U

| -

C

G

C|

U

U

G

U|

U

-

True states

Emission symbols

<Structural states>M : match stateN : mismatch stateD : deletion stateI : insertion state

<Hidden states>T : true state(mature miRNA)F : false state

(a)

(b)

4

D

N M

S E

I

1

2 3

5

6, 7, 8

9

(c)

TMD TDM TMN TMI

EM(GU) ED(- C) EM(GC) EN(UU) EM(GU) EI(U- )Emission probabilities

T0MTransition probabilities

State sequence π

M+ N+

I+ D+

M+ N+

I+ D+

M+ N+

I+ D+

M+ N+

I+ D+

M+ N+

I+ D+

M+ N+

I+ D+S

1 2 3 94 5,6,7,8

False states

Human microRNA prediction through a probabilistic co-learning model of sequence and structure, Jin-Wu Nam, Ki-Roo Shin, Jinju Han, Yoontae Lee, V. Narry Kim, Byoung-Tak Zhang, Nucleic Acids Research, 33(11):3570-3581, 2005.

ProMiR II: A web server for the probabilistic prediction of clustered, nonclustered, conserved and nonconserved microRNAs, Jin-Wu Nam, Jin-Han Kim, Sung-Kyu Kim, Byoung-Tak Zhang, Nucleic Acids Research, 34:W455-W458, 2006.

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

17

Prediction for cardiovascular diseasePrediction for cardiovascular disease

Using Aptamer chip data Disease prediction by Decision Tree, Artificial Neural Networks, Baye

sian Networks, and Support Vector Machines

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

18

microRNA target predictionmicroRNA target prediction

Feature set for target prediction

검증된 데이터의 학습 : 실험적으로 검증된 데이터의 기계학습 적용 다대다 miRNA:Target network 예측

검증된 데이터의 학습 : 실험적으로 검증된 데이터의 기계학습 적용 다대다 miRNA:Target network 예측

Target prediction by

support vector machine

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

19

Multiobjective Optimization-Based OligonuclMultiobjective Optimization-Based Oligonucleotide Optimizationeotide Optimization

ParentParentttParentParenttt OffspringOffspringttOffspringOffspringtt

Non dominated sorting

ParentParentt+1t+1 OffspringOffspringt+1t+1OffspringOffspringt+1t+1

Genetic operation

+

Resulting ProbesMulti-Objective Evolutionary Algorithm

사용자의 여러 요구 조건을 다중 목적함수로 사용 응용에 따라 목적함수의 변경 및 새로운 목적함수의 추가 가능 바이오메드랩의 HPV 판별칩 용 probe design 에 적용하여 유용성 입증

사용자의 여러 요구 조건을 다중 목적함수로 사용 응용에 따라 목적함수의 변경 및 새로운 목적함수의 추가 가능 바이오메드랩의 HPV 판별칩 용 probe design 에 적용하여 유용성 입증

Multiobjective evolutionary optimization of DNA sequences for reliable DNA computing, S.-Y. Shin, I.-H. Lee, D. Kim, and B.-T. Zhang, IEEE Trans. Evo. Comp., 2005.

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

20

Phylogenetic Tree Construction based on kernPhylogenetic Tree Construction based on kernel methodsel methods Construction of phylogenetic trees by kernel-based comparative analysis

of metabolic networks, S. J. Oh, J.-G, Joung, J.-H. Chang and B.-T. Zhang, BMC Bioinformatics, 7:284, 2006.

Metabolic Pathway 기반의 Phylogenetic 분석 Graph Kernel 의한 종간의 유사도 비교하는 새로운 방법론 제안 Biological Pathway 분석을 통한 관심 후보 Target 들의 발굴에 기여

Metabolic Pathway 기반의 Phylogenetic 분석 Graph Kernel 의한 종간의 유사도 비교하는 새로운 방법론 제안 Biological Pathway 분석을 통한 관심 후보 Target 들의 발굴에 기여

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

21

Tree-Based Biochemical Network Tree-Based Biochemical Network IdentificationIdentification

Genetic Programming

OffspringOffspring

PopulationPopulation

SelectionSelection

Genetic operatorsGenetic operators

Initialization

Survival of fitness

Biochemical network 의 효율적 표현을 위한 S-tree representation 제안 시계열 자료로부터 S-tree 를 학습할 수 있는 유전 프로그래밍 기법 개발 새로운 네트워크 구조의 규명에 있어서 시스템적 관점의 분석 가능

Biochemical network 의 효율적 표현을 위한 S-tree representation 제안 시계열 자료로부터 S-tree 를 학습할 수 있는 유전 프로그래밍 기법 개발 새로운 네트워크 구조의 규명에 있어서 시스템적 관점의 분석 가능

Identification of biochemical networks by S-tree based genetic programming, D.-Y. Cho, K.-H., Cho, and B.-T. Zhang, Bioinformatics, 22(13):1631-1640, 2006

Time-series profiles

S-Tree representation

Biochemical Network Modeling

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

22

Co-clustering by probabilistic evolutionary Co-clustering by probabilistic evolutionary learning for module detectionlearning for module detection

Population-based Probabilistic Learning

다양한 Genome-Wide 데이터를 통한 Coclustering 분석기법 제안 Coclustering Evolutionary Algorithm 에 의한 mRNA-miRNA Module 탐색 mRNA-miRNA 의 Functional 상관관계 규명에 기여

다양한 Genome-Wide 데이터를 통한 Coclustering 분석기법 제안 Coclustering Evolutionary Algorithm 에 의한 mRNA-miRNA Module 탐색 mRNA-miRNA 의 Functional 상관관계 규명에 기여

miR

NA

1

mRNA1

miR

NA

2m

iRN

A3

miR

NA

4m

iRN

A5

miR

NA

6m

iRN

A7

miR

NA

8

mRNA2mRNA3mRNA4mRNA5mRNA6mRNA7mRNA8mRNA9mRNA10mRNA11mRNA12mRNA13

Arr

ay1

Arr

ay2

Arr

ay3

Arr

ay4

Arr

ay5

Arr

ay6

Array1Array2Array3Array4Array5Array6

miRNA module

miRNA expression

mRNA expression

target scores of miRNAs

Heterogeneous DatasetsCoherent Transcriptional Modules

Discovery of microRNA-mRNA modules via population- based probabilistic learning, J.-G. Joung, K.-B. Hwang, J.-W. Nam, S.-J. Kim and B.-T. Zhang, Bioinformatics, 23(9):1141-1147, 2007.

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

23

Co-clustering by probabilistic latent variable Co-clustering by probabilistic latent variable model with Heterogeneous Datasetsmodel with Heterogeneous Datasets

Position Weighted Matrixes (PWMs)

Stem Cell Subpopulations

Transcription Factors

Gene Expression Datasets

이종 데이터를 통한 Coclustering 분석기법 제안 Coclustering Latent Variable Model 에 의한 Regulatory Module 탐색 Systematic Regulatory Mechanism 규명에 기여

이종 데이터를 통한 Coclustering 분석기법 제안 Coclustering Latent Variable Model 에 의한 Regulatory Module 탐색 Systematic Regulatory Mechanism 규명에 기여

Identification of regulatory modules by co-clustering latent variable models: stem cell differentiation, J.-G, Joung, D. Shin, R.-H. Seong and B.-T. Zhang, Bioinformatics, 22(16): 2005-2011, 2006.

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

24

Modeling for temporal gene expression Modeling for temporal gene expression profilesprofiles

SMD

Gene Expression Profiles

Self-organizing latent lattice models for temporal gene expression profiling, B.-T. Zhang, J. Yang, and S. W. Chi, Machine Learning, 52(1/2): 67-89, 2003.

Gene Pair Selection

Update PatternPrototype

Update Latent Grid &Build Interaction Site Map

Visualize Gene Pairs in the

Interaction Site Map

Clustering

Extract InteractiveGene Pairs

복수의 관련유전자의 발현양상의 시각화 가능 은닉노드로부터 특정 동적발현패턴에 해당하는 expression 값들의 생성 기능적으로 연관된 유전자 쌍의 탐색

복수의 관련유전자의 발현양상의 시각화 가능 은닉노드로부터 특정 동적발현패턴에 해당하는 expression 값들의 생성 기능적으로 연관된 유전자 쌍의 탐색

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

25

Hierarchical Bayesian Networks for Large-Hierarchical Bayesian Networks for Large-scale Network Constructionscale Network Construction

• Hierarchical probabilistic graphical models for large-scale data analysis, Hwang, Kyu-Baek, Ph.D. Thesis, School of Computer Science and Engineering, Seoul National University, August 2005.

• Learning hierarchical Bayesian networks for large-scale data analysis, K.-B. Hwang, B.-H. Kim, and B.-T. Zhang, Lecture Notes in Computer Science, 4232:670-679, 2006.

대규모 데이터의 정보를 압축하여 임의 규모의 요약 - 베이지안망 작성 가능 대규모 베이지안망의 단계적 가시화 자연계에 존재하는 복잡계 네트워크의 특성을 반영 : 베이지안망의 특성 (

요인간 관계의 확률통계적 표현 ) + 군집화의 특성

대규모 데이터의 정보를 압축하여 임의 규모의 요약 - 베이지안망 작성 가능 대규모 베이지안망의 단계적 가시화 자연계에 존재하는 복잡계 네트워크의 특성을 반영 : 베이지안망의 특성 (

요인간 관계의 확률통계적 표현 ) + 군집화의 특성

© 2008 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr/

26

Text MiningText Mining

Biomedical literature 에서 gene/protein interaction 에 대한 recognition / extraction / inference / visualization 을 수행할 수 있는 통합 integrated text mining platformintegrated text mining platform 개발

효율적인 Text mining 을 위한 tree kernel 기반 interaction sentence classifierinteraction sentence classifier 개발 알려진 상호작용 데이터에 data mining 을 적용하여 새로운 상호작용을 예측하는 protein protein

interaction prediction modelinteraction prediction model 의 개발

Biomedical literature 에서 gene/protein interaction 에 대한 recognition / extraction / inference / visualization 을 수행할 수 있는 통합 integrated text mining platformintegrated text mining platform 개발

효율적인 Text mining 을 위한 tree kernel 기반 interaction sentence classifierinteraction sentence classifier 개발 알려진 상호작용 데이터에 data mining 을 적용하여 새로운 상호작용을 예측하는 protein protein

interaction prediction modelinteraction prediction model 의 개발

A tree kernel-based method for protein-protein interaction mining from biomedical literature, Jae-Hong Eom, Sun KimA tree kernel-based method for protein-protein interaction mining from biomedical literature, Jae-Hong Eom, Sun Kim, Seong-Hwan Kim, Byoung-Tak Zhang, , Seong-Hwan Kim, Byoung-Tak Zhang, Lecture Notes in BioinformaticsLecture Notes in Bioinformatics, 3886:42-52, 2006., 3886:42-52, 2006.

PubMiner: machine learning-based text mining system for biomedical information mining, J.-H. Eom and B.-T. Zhang, PubMiner: machine learning-based text mining system for biomedical information mining, J.-H. Eom and B.-T. Zhang, Genomics & InformaticsGenomics & Informatics, 2(2):99-106, 2004., 2(2):99-106, 2004.

Prediction of implicit protein-protein interaction by optimal associative feature mining, J.-H. Eom, J.-H. Chang and B.-Prediction of implicit protein-protein interaction by optimal associative feature mining, J.-H. Eom, J.-H. Chang and B.-T. Zhang, T. Zhang, Lecture Notes in Computer ScienceLecture Notes in Computer Science, 3177:85-91, 2004, 3177:85-91, 2004