machine learning for outline computational biology · machine learning for computational biology...

18
5/31/2014 Cukurova University --------------------- University of Aveiro 1 MACHINE LEARNING for COMPUTATIONAL BIOLOGY Associate Prof. Dr. Turgay Ibrikci [email protected] ÇUKUROVA UNIVERSITY Adana, TURKEY www.cu.edu.tr 2 Outline Who we are Introduction Researches Experiments Results and Conclusion Questions Our Team İrem ERSOZ KAYA, PhD. Mersin University, Technical Technics Faculty, Software Engineering Dept. Mersin, Turkey Ayça ÇAKMAK PEHLIVANLI, PhD. Mimar Sinan Fine Arts University, Statistics Dept. Istanbul, Turkey Mustafa KARABULUT, PhD. Gaziantep University, Gaziantep, Turkey Doctorate Students Esra Mahreseci KARABULUT Jale BEKTAS Collaborates Prof. Dr. Jessica Kissinger, University of Georgia, USA Prof. Dr. Okan ERSOY Purdue University, USA Prof. Dr. Seyhan TUKEL, Cukurova Univesity, Turkey Supporters Çukurova University Research Fund The Scientific and technological Research Council of Turkey TUBITAK Subdivision EEEAG- TBAG Introduction Machine Learning Methods Proteins secondary structures, Disorder protein structures, Drug Desgn Motif Finding on DNA. Prediction of order/disorder regions of protein is one of the main problems in drug design. Order / Disorder regions effect on protein’s functions Druglike selection, Potential drug candidates Statistical and Computational Learning methods help to predict the disordered regions

Upload: duonghuong

Post on 21-Apr-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 1

MACHINE LEARNING for

COMPUTATIONAL BIOLOGY Associate Prof. Dr. Turgay Ibrikci

[email protected]

ÇUKUROVA UNIVERSITY

Adana, TURKEY

www.cu.edu.tr 2

Outline

Who we are

Introduction

Researches

– Experiments

– Results and Conclusion

Questions

Our Team

İrem ERSOZ KAYA, PhD. Mersin University, Technical

Technics Faculty, Software Engineering Dept. Mersin, Turkey

Ayça ÇAKMAK PEHLIVANLI, PhD. Mimar Sinan Fine

Arts University, Statistics Dept. Istanbul, Turkey

Mustafa KARABULUT, PhD. Gaziantep University,

Gaziantep, Turkey

Doctorate Students

Esra Mahreseci KARABULUT

Jale BEKTAS

Collaborates

Prof. Dr. Jessica Kissinger, University of

Georgia, USA

Prof. Dr. Okan ERSOY Purdue University,

USA

Prof. Dr. Seyhan TUKEL, Cukurova Univesity,

Turkey

Supporters

Çukurova University Research Fund

The Scientific and technological Research

Council of Turkey TUBITAK

– Subdivision – EEEAG- TBAG

Introduction

Machine Learning Methods – Proteins secondary structures,

– Disorder protein structures,

– Drug Desgn

– Motif Finding on DNA.

Prediction of order/disorder regions of protein is one of the main problems in drug design.

Order / Disorder regions effect on protein’s functions

Druglike selection, Potential drug candidates

Statistical and Computational Learning methods help to predict the disordered regions

Page 2: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 2

What is Bioinformatics?

Bioinformatics is the organization and the analysis of biological data.

Computational

Tool

Biological

Data

Biological

Information

Biologists collect molecular data: DNA & Protein

sequences, gene expression, etc.

Computer scientists (+Mathematicians, Statisticians, etc.)

Develop tools, softwares, algorithms

to store and analyze the data.

Bioinformaticians Study biological

questions by analyzing

biological data

What is Bioinformatics?

Informatics Computer Science

Computer Engineering

Information Science

Biology &

Other

Natural

Sciences

Mathematics

& Statistics

Bioinformatics

Ethical, legal, &

social

implications

We can say that not only biology and computer science but also “The field of science in multiple sciences and and information technology merges into a single discipline " Bioinformatics concerns...

Prediction

Comparison

Pattern Recognition

Data Modeling

Data Mining

Optimization

Rendering and Display

Doing it all on a computer….

Biological Data

Genes

– DNA sequences of A, T, C, G

– Annotated with function, features

Proteins

– Amino acid sequences Sequences of 20 letters

– Annotated with structure, function etc.

Proteins

A protein consists of a linear sequence of the twenty naturally occurring amino acids

Protein structures are described through four main hierarchical levels: Primary, Secondary, Tertiary, Quaternary.

R

C

H

OH

O

C' N

H

H

Amino

Group

Carboxyl

Group

Side Chain

Page 3: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 3

Levels of Protein Structure Prediction of Disordered Regions of Proteins with New N-Pieces Naive Bayes Algorithm*

13250 amino acid-chains

The SVM with Gaussian-RBF kernel function

Three classes for secondary structures

Sliding windows

The window size 17

Leave-one-out validation

* U. Orhan, I. Ersoz, T. Ibrikci, (2007), Intelligent Engineering Systems Through Artificial Neural Networks: Smart

System Engineering (ANNIE'07), 17: 43-48, 11-14 Nov. 2007, St. Louis, Missouri, USA.

15

Dataset

The dataset R80*

The dataset includes 80 protein taken from the study of Yang

The dataset has two categories

Train and test datasets

* Z. R. Yang, R. Thomson, P. McNeil and R. M. Esnouf. RONN: The Bio-Basis Function Neural

Network Technique Applied To The Detection of Natively Disordered. Bioinformatics, 21, 3369-3376, 2005

Number of chains 80

Number of ordered regions

151

Number of disordered regions

183

Number of residues in the ordered regions

29909

Number of residues in the disordered regions

3649

Total residues in the dataset

33558

16

N-Pieces Naïve Algorithm

It can be told that this algorithm is a pre-classifier algorithm.

Data is divided into parts by some threshold values.

After implementing N-Pieces algorithm, the classical Naïve

is applied for classification.

Beginning of N-Pieces algorithm is a unsupervised

classifier-clustering data in N pieces, then Naïve is a

supervised classification method.

17

Experiments

Two classes - order/disorder

regions

Normalization of the dataset

Sliding windows

Controling the outputs with

five differents measurements

Eight different N Values

The given gap is between 101, 201 for N space.

18

The results of The “N”s - II

The N values are found as 101, 133, and 201 for N-

pieces algorithm.

Page 4: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 4

19

Conclusions

We presented a new Naive Bayesian Learning

algorithm which is called NP-NBL.

The algorithm is applied on the prediction of

disorder region of proteins.

The numbers of 101,133 and 201 for partition are

optimal values with the specificity and the correct

classification.

However, the optimal N value depends on

individual applications.

COMPUTATIONAL PREDICTION OF DISORDERED REGIONS IN PROTEINS-

PhD Thesis

Structure-Function Paradigm:

– 3D structure of a protein is a prerequisite for

its biological function.

– The tenet has been arisen more than 100

years ago with Fischer’s “lock and key”

model.

– Generally the loss of function is associated

with the lack of specific 3D structure.

DNA mRNA AA

sequence

3D

structure Function

Objective

Purpose?

– Developing an accurate computational method that

can provide information about the structural class

among ordered and disordered proteins concerning

different biochemical and physical features of amino

acids.

Why?

– Accurate prediction of disorder that is a demanding

problem due to its importance on structural and

functional identification of protein.

Protein Folding Problem

Structural Prediction

– 3D structure (‘the fold’) is uniquely determined by the

sequence.

– The breakthrough brought the outstanding success as the

1972 Nobel Prize to Anfinsen.

– Experimental and computational methods have performed

on that challenge that has widely known as the “The

Protein Folding Problem”.

Protein Non-Folding Problem

Intrinsically Disordered Proteins (IDP)

– Studies on structural genomics indicate that numerous

protein segments remain unfolded in their native states.

– Contrary to the structure – function paradigm, the regions

that fail to fold into a fixed 3D structure yet exhibit function.

– The proteins are generally referred to as “natively unfolded”

or “intrinsically disordered”.

– Intrinsically disordered proteins can also involve in several diseases such as Alzheimer disease, Parkinson disease and certain types of cancer.

– Dealing with structural identification of the proteins was named as “The Protein Non-Folding Problem”.

An Example: Calcineurin

Page 5: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 5

Computational Prediction

Alternative to experimental methods

Disorder prediction based on amino acid sequence

Structural properties of amino acids can be used

as a discriminator for characterizing disorder

– Flexibility, amino acid frequency, complexity, charge, and

secondary structure, etc.

Mostly preferred machine learning techniques can

be given as Artificial Neural Networks and Support

Vector Machines

Computational for Disorder Prediction

NAME YEAR ACCURACY COMPUTATINAL METHODS BASED ON

PONDR 2001 70% Feed Forward Neural Networks Physicochemical Properties, Frequency

DisEMBL 2003 64% Feed Forward Neural Networks Parameters on Different Definitions of

Disorder

DISOPRED2 2004 93% Linear Support Vector Machines Position Specific Score Matrices

FoldIndex 2005 83% Low Hydrophobicity/High Net Charge Hydrophobicity, Net Charge

RONN 2005 85% Regional Order Neural Network Homology Alignment Scores

PSSMP 2006 80% Radial Basis Function Neural Network Condensed Position Specific Score

Matrices

Data and Representation

Main Data Set (Training and Testing)

– 80 completely ordered proteins

– 79 completely disordered proteins

Blind Testing Set : 80 partially disordered proteins

Windowing

– Sliding windows technique

– 21 residue length window

– Each window is an input pattern

– The input pattern is labelled with the class of the central amino acid within window (‘1’ for disorder, ‘0’ for order)

Knowledge Presentation

– Average information within window

– Each feature is represented by only one attribute in a pattern

>1B8Z

MNKKELIDRVAKKAGAKKKDVKLILDTILETITEALAKGEKVQIVGFGSF

EVRKAAARKGVNPQTRKPITIPERKVPKFKPGKALKEKVK

PSI-BLAST

PSSM for the sequence

Sequence

20 amino

acids

Pattern for ith residue

The averaged sum of

the row values

within window

for each column

ith residue

window w

-4 -4 -5 -6 -4 -3 -5 -6 -5 -1 2 -4 10 -3 -5 -4 -3 -4 -4 0

-3 -2 6 -2 -4 -2 -3 -4 -3 0 -3 -2 -4 -5 -4 -1 5 -6 -5 -3

-3 2 -1 -3 -6 -1 -2 -4 -3 -5 -4 7 -3 -6 -4 -3 -3 -6 -5 -4

3 -2 0 -3 -4 0 -2 -2 -3 -3 -3 1 -1 -5 -2 4 1 -5 -4 -3

-3 -3 -2 5 -6 2 6 -4 -1 -6 -5 -1 -5 -6 -4 -3 -3 -6 -5 -5

.

.

.

4 4 -3 -2 -4 1 -1 -2 -3 -3 -2 2 -2 -5 -4 0 -2 -5 -3 -1

-2 -5 -5 -6 -3 -5 -5 -6 -5 3 1 -5 0 -1 -5 -4 -2 -5 -3 6

-3 2 -2 -2 -5 3 3 -4 -2 -5 -4 5 -3 -5 -3 -2 -2 -5 -4 -4

-2 -2 -0 -2 -5 -1 -1 -4 -3 -3 -3 0 -1 -5 -4 -1 -1 -5 -5 -3

Training

Main data set was divided into 6 roughly equal

subset

Each subset includes balanced number of

disordered and ordered residues

1 subset was used for validation

5 subsets were used for training via 5 fold cross

validation

5 repetitions of training/testing application

Performance Assessment

(TP+TN)Accuracy ( )

(TP+FP+TN+FN)Acc ( )

TPSensitivity Sens

TP FN

TP FP TN FN TP FN TN

( )' ( )

( ) ( ) ( ) ( )

TP TN FP FNMatthews Correlation Coefficient Mcc

TP FP FP TN TN FN FN TP

( )( ) ( )

TPxTN FPxFNProbability Excess ProbEx

TP FN x TN FP

Page 6: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 6

Methods

Border Vectors Detection and Adaptation

Border Vectors Detection and Extended

Adaptation

Border Vectors Detection and Adaptation (BVDA)*

Based on detection of so-called border vectors, adaptation the

vectors by adjusting their positions for wrongly classified input

patterns and addition of new vectors during training

– Detection:

* IEEE TRAN. ON GEOSCIENCE AND REMOTE SENSING, VOL. 45, NO. 12, 3880-3893 DEC. 2007

2

1

arg min

, , (1 ), (1 )( , ) ( ( ) ( )) ,

j

Ni k

j i j i j j j

d

k

i m j no d x d x y i

D

c xD o x

t( , ) = , 1,..., ( m )

arg min

j k j k j

j

j m

w D

D x b x b

w ky y q If then (t) (t 1) ( , )k ky q R Ri=q i=q x

Border Vectors Detection and Adaptation (BVDA)

– Adaptation:

(t 1) (t) (t).( (t))

(t 1) ( (t) (t).( (t))) /wj b

y y y y

y ym . m

b b b bw w w w

w w j w

j w

b b x b

m m x b

(t 1) (t) (t).( (t))

(t 1) ( (t) (t).( (t))) /lj b

y y y y

y ym . m

b b b bl l l l

l l j l

j l

b b x b

m m x b

0 i m

0B B i

{( , )}, (t t ')wj m jy y y B B

t+1 t

jx (t 1) ( (t) ) /( (t) 1)jy y y ym . m

j j jjm m x

1 1 2 2( , ), ( , ), ... , ( , )m my y y0M m m m

Border Vectors Detection and Extended Adaptation (BVDEA)*

Based on detection of so-called border vectors, adaptation the

vectors by adjusting their positions for all input patterns

– Detection:

Applied in the way offered by BVDA

– Adaptation:

At least one border vector is ensured to be adapted in each repetition

*İ. Ersöz Kaya, T. Ibrikci, O. K Ersoy, (2011), “Prediction of Disorder with New Computational Tool: BVDEA”,

Expert Systems with Applications. 38(12): 14451-14459. (ISI), DOI: 10.1016/j.eswa.2011.04.160

( 1) ( ) ( ).( ( ))

( 1) ( ) ( ).( ( ))w

j j j

j b

y y y

t t t ty y

t t t t

w w j w

j

b b x b

b b x b

( 1) ( ) ( ).( ( ))wj by y t t t t w w j wb b x b

Evaluations on Parameters

BDVEA

– Rate ()-Decay Constant ()

– Stopping levels for - pairs

BDVA

– Rate ()-Decay Constant ()

– Stopping levels for - pairs

GRNN

– Sigma ()

LVQ

– Rate ()

– Number of codebook vectors (nC)

Main Testing Results

Methods Sens Spec Acc Mcc ProbEx

BVDEA 0.7964 0.7850 0.7907 0.5858 0.5814

LVQ 0.6981 0.8707 0.7844 0.5818 0.5688

GRNN 0.7263 0.8345 0.7804 0.5714 0.5608

BVDA 0.7309 0.7506 0.7408 0.4878 0.4815

Page 7: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 7

Comparison on Main Testing

DisPro

DISOPRED2 BVDEA

DisPSSMP

DisPro

DISOPRED2 BVDEA

DisPSSMP

DisPro

DISOPRED2 BVDEA

DisPSSMP

Conclusions

Many of the intrinsically disordered proteins play key role in vital functions and also in some diseases.

Identification of the disordered regions is a demanding process for structure prediction and functional characterization of proteins.

BDVEA provides more accurate, fast and robust learning as compared to the other methods, GRNN, LVQ and BDVA.

As evident from the comparison results with existing tools, BVDEA can be suggested as an influential method to achieve accurate predictions of disordered regions of proteins without either under-predicting or over-predicting the disorder.

The new method provides a significant contribution on predicting disorder and order regions of proteins.

Support Vector Machine

SVMs attempt to find a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized (Vapnik, 1995). B

1

B2

b11

b12

b21

b22

margin

› Find hyperplane that

maximizes the margin

=>B1 is better than

B2

The original input space can always be mapped to some higher-dimensional feature space where the training set is separable

Φ: x → φ(x)

Non-Linear SVMs

Quadratic Optimization Problem

Minimize

subject to

n

i

iCw1

2

2

1

1iyif iii bxwy 1

iii bxwy 1 if 1iy

0i

Solution

i

n

i

iiii

n

ji

i

n

i

i

bxwy

CwbwL

11,

1

2

1

2

1,,

Performance of Feature Selection

SVM_COD159 SVM_ERCOD159

Sens 0,819 0,831

Spec 0,638 0,662

Acc 0,641 0,664

Mcc 0,854 0,862

ProbEx 0,786 0,800

37 attributes was selected by ERGS

Prediction success was increased with the

reduced data, ERCOD159

Page 8: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 8

Conclusion

SVM was used with the modeled dataset that was constructed according to several physicochemical properties, evolutionary knowledge, and compositions of amino acids.

SVM_COD159 provides more accurate and robust learning as compared to eleven common tools without either under-predicting or over-predicting the disorder.

The most informative features for separating the disordered/ordered regions in proteins were determined by using the ERGS method.

SVM_ERCOD159 provides a significant contribution on predicting disorder and order regions of proteins.

CONSENSUAL CLASSIFICATION OF

DRUG/NONDRUG COMPOUNDS FOR DRUG

DESIGN- PhD Thesis

•Druglike selection

•Potential drug candidates

•To reduce time and cost

In Vivo In Vitro In Silico

Chemoinformatics & Bioinformatics

Chemoinformatics Bioinformatics

Chemical data (small molecules) Enzymes, genes, proteins, etc.

1960’s 1990’s

User-pay, limited public access Web-based, open access model

Funded by large companies (MDL,

Bielstein, Sigma, CAS)

Funded by large government agencies

(NCBI, EBI, NIH, GC)

Molecular Informatics

• Contains functional group

• Similar physical properties to the known drugs

• ADMET (Absorbtion, Distribution, Metabolism,

Extraction, Toxicity)

• Fail fast, fail cheap

• “Filters”, i.e. “criteira” or “rule”

Druglike Concept

1998

– Ajay et al., 80% accuracy

– Sadowski & Kubinyi, 77% drug, 83% nondrug

2000

– Wagener et al., 17.4% error

– Firumer et al., 88.0%

2001

– Muege et al., 83.7% drug, 75% nondrug

2003

– Murcia-Soler et al., 76.36% drug, 70.15% nondrug

– Byvatov et al., 82% SVM, 80% ANN

2007

– Li et al., 92.73%

– Hutter et al., 71%

Related Works

Therapeutic Categories Murcia-Soler, 2003 Cherkasov, 2007

Drugs 416 1482

Nondrugs 225 1202

Total 641 2684

Data set

Page 9: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 9

Separate for each descriptor

Adaptive General Regressional Neural

Networks (adGRNN)

j

j

2

1 1

2

1 1

1exp /

2

1exp /

2

n di i

j j j

i j

n di

j j j

i j

Y x x

Y X

x x

Experiment’s Outline

Murcia-Soler – Modal (a) 61 descriptors

– Modal (b) All 2D MOE descriptors

– Modal (c) Principle Components

60 PCs (full)

103 PCs (within groups)

Cherkasov – Modal with 2 classes All 2D MOE descriptors

– Modal with 3 classes All 2D MOE descriptors

Consensus

by

Genetic Algorithm

Consensus

by

Pseudoinverse

Consensus

by

Equal WeightsPostprocessing

raw data

normalization

(z-scores)

normalized data

Pre-preprocessing unit

. . . . . transformed

data set

transformed

data set

transformed

data set

transformed

data set

output1 output2 output3 outputK

Classifier1 Classifier2 Classifier3 ClassifierK

. . . . .

. . . . .

Consensual Result

by

Genetic Algorithm

Consensual Result

by

Pseudoinverse

Consensual Result

by

Equal Weights

Preprocessing unit

1. Transform descriptor

vectors with random

unifrom matrix

2. 1D median filtering

Results of Murcia-Soler’s Dataset

• Orginal study of Murcia-Soler et al.,

Feedforward 76% drug and 70% nondrugs

•The best individual results of

GRNN 82.37%, adGRNN 82.99% and SOGR 81.90%

Results of Murcia-Soler’s Dataset Results of Cherkasov’s Dataset

Page 10: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 10

Individual SOGR Results of 3-Class Cherkasov

60.000

64.000

68.000

72.000

76.000

80.000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Classifier numbers

(%)

antimicrobials drugs drug-likes

Results of Cherkasov’s Dataset

Model with 3

Classes

Antimicrobials... 1

Drugs…………..2

Druglikes……… 3

Contribute automation tools to the drug designers at

the preprocessing stage of the design process

A novel consensual classification approach

Genetic Algorithm was used first time for combining

SOGR and also consensus approach was applied first

time to chemical data

Conclusions

PART II

DNA MOTIFS-PhD

DNA MOTIFS Outline

DNA Motif Discovery

(Pattern Recognition in DNA Sequences) Introduction

Biological background of DNA motifs

Motif finding

– Existing algorithms

– Motif representation

Machine learning / Clustering for motif-finding

First clustering implementation impressions

Further ideas – Improving first implementation – Utilizing optimization techniques to support motif-finding

Particle Swarm Optimization (PSO)

Genetic Algorithm (GA)

Conclusion

Introduction & Motivation

Details of the transcription process have received considerable attention since the process is an important step in protein synthesis which is vital for life.

Identification of transcription binding sites, which are essentially short DNA subsequences with variability, has been one of the major and challenging tasks for bioinformaticians.

Introduction & Motivation

These genomic patterns (motifs) reside on very

long DNA sequences, which make the task

irresolvable for traditional computational

methods.

Motifs are very short (generally 6-20

nucleotides long) and may have variations as a

result of mutations, insertions and deletions.

Page 11: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 11

Introduction & Motivation

Many algorithms are proposed in the literature,

but none of has reached perfect accuracy yet !

Thus, research over motif discovery is still an

important task.

Biological background

The transcription of each gene is controlled by

a regulatory region of DNA relatively near the

transcription start site (TSS).

two types of fundamental components

– short DNA regulatory elements (motifs)

– gene regulatory proteins that recognize and bind to

them.

Gene Regulatory Element, TF binding site, TF

binding motif, cis-regulatory motif (element)

RNA polymerase

(Protein)

Transcription Factor (TF) (Protein)

DNA

Regulation of Genes

Biological background

Promoter

Gene

RNA

polymerase

Transcription Factor

(Protein)

Regulatory Element

DNA

Regulation of Genes

Biological background

Gene

RNA

polymerase Transcription

Factor

Regulatory Element

DNA

New protein

Regulation of Genes

Biological background Biological background

WHAT IS A MOTIF ?

A subsequence (substring) that occurs in multiple sequences with a biological importance.

Motifs can be totally constant or they have variable elements.

DNA Motifs (regulatory elements) – Binding sites for proteins

– Short sequences (5-25)

– Up to 1000 bp (or farther) from gene

– Inexactly repeating(overrepresented) patterns

Page 12: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 12

GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA

TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA

CTGGGAGGTCCTCGGTTCAGAGTCACAGAGCAGATAATCA

TTAGAGGCACAATTGCTTGGGTGGTGCACAAAAAAACAAG

AACAGCCTTGGATTAGCTGCTGGGGGGGTGAGTGGTCCAC

ATCAGAATGGGTGGTCCATATATCCCAAAGAAGAGGGTAG

TF

TF

TF

TF

TF

TF

123456789

TGGGTGGTC

TGGGTGGTA

TGGGAGGTC

TGGGTGGTG

TGAGTGGTC

TGGGTGGTC

Transcription Factor Binding Sites (TFBS)

DNA motif representation with consensus sequence:

TGGGTGGTN (Consensus sequence)

Biological background

DNA motif representation with Position

Weight Matrix (PWM):

Biological background

( | )log

( | )

P S PFMPWM

P S B

A: 0.25

T: 0.25

G: 0.25

C: 0.25

A: 0.25

T: 0.25

G: 0.25

C: 0.25

Background DNA (B)

.2

.2

.5

.1

.7 .2 .2 .1 .3

.1 .2 .4 .5 .4

.1 .2 .2 .2 .2

.1 .4 .1 .2 .1 A

C

G

T -0.3

-0.3

1

-1.3

1.4 -0.3 -0.3 -1.3 0.3

-1.3 -0.3 0.6 1 0.6

-1.3 -0.3 0.3 -0.3 -0.3

-1.3 0.6 -1.3 -0.3 -1.3 A

C

G

T

Position Weight

Matrix (PWM) PFM

DNA motif representation with sequence logo:

Biological background

Represent both base frequency and conservation at

each position

Height of letter proportional

to frequency of base at that position

Height of stack proportional

to conservation at that position

Motif finding

The Problem

– Given a collection of genes with common expression (co-

expressed)

– Find the TF-binding sites (motifs)

Difficulties

– Motif pattern is unknown

– Motif locations are unknown

– Motifs can differ slightly from one gene to another

– How to discern it from “random” motifs?

Motif finding

Existing algorithms generally fall into two

categories

– Probabilistic

MEME(1995), AlignACE(1999)

Try to optimize a Position Weight Matrix (PWM)

Fast

– Word enumerative exhaustive search

YMF(2000), Weeder(2004)

Slower but more reliable

Motif finding

Challenges that cause low prediction performance – Low signal/noise (background) ratio

– Very long DNA sequences

– Low signal strength (weak conservation )

Solutions – Using more biological information (e.g. Phylogenetic

footprints)

– Hybrid combinations of existing algorithms

– Ensembling existing algorithms

– Developing new algorithms (By using Machine learning methods, evolutionary algorithms, Genetic Algorithm, PSO etc)

Page 13: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 13

Machine Learning/Clustering for motif-finding

Machine learning

– A relatively newer promising direction

Mahony et al 2006 – Self Organizing Maps of Position

Weight Matrices

– Clustering as a local alignment tool

Self-Organizing map

Fuzzy C-Means

K-Means

Gaussian Mixture Models

Machine Learning/Clustering for motif-finding

Clustering as a motif finding tool (The Steps of the process)

– Extracting inputs with sliding-windows

– Scoring inputs against Position weight matrices

– Associating inputs with clusters (PWMs) and

generating several local alignments

– Selecting statistically most significant alignments as

motif candidates

Machine Learning/Clustering for motif-finding

Extracting inputs with sliding-windows

N-l number of windows, i.e., inputs to

the algorithm

Machine Learning/Clustering for motif-finding

Scoring inputs against Position weight matrices

(1)

– PWMs are initialized randomly at the beginning

– PWMs are recalculated at each iteration of the

algorithm

Machine Learning/Clustering for motif-finding

Scoring inputs against Position weight matrices (2)

1 2 3 4 5 6 7 8 9

A -10 -10 -14 -12 -10 5 -2 -10 -6

C 5 -10 -13 -13 -7 -15 -13 3 -4

G -3 -14 -13 -11 5 -12 -13 2 -7

T -5 5 5 5 -10 -9 5 -11 5 C T T T G A T C T

INPUT SEQ 1 5 + 5 + 5 + 5 + 5 + 5 + 5 + 3 + 5 = 43

A C G T A C G T A

INPUT SEQ 2 -10 -10 -13 + 5 -10 -15 -13 -11 - 6 = -83

Machine Learning/Clustering for motif-finding

Associating inputs with clusters (PWMs) and

generating several local alignments

(a) a set of associated sequences, (b) the

probability matrix, (c) the background

model and (d) the resultant PWM.

Page 14: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 14

Machine Learning/Clustering for motif-finding

Selecting statistically most significant

alignments as motif candidates

– Rank all PWMs with Z-Score values

O Ez score

O = the number of associated subsequences

E = the number of subsequences which coincide to the

node by chance

σ = the standard-deviation of the coincidence

First clustering implementation impressions

– Karabulut, M., İbrikci, T., “Fuzzy C-Means Based DNA

Motif Discovery”, Lecture Notes in Computer Science,

Vol. 5226, pp 189-195 (2008)

– İbrikci, T., Karabulut, M., “Employing Fuzzy C-Means For

DNA Transcription Factor Binding Site Identification”,

Journal of Circuits, Systems and Computers, Vol 19-1,

pp 15-30 (2010)

First clustering implementation impressions

Some modifications to utilize FCM(Fuzzy C-Means) for the task

are required

First off, the classical FCM Algorithm in two main steps:

– Membership calculation

– Cluster update

1/( 1)

2

1/( 1)

21

1( )

( )

1( )

( )

q

j i

ij Mq

k j k

d x cu

d x c

1

1

( )

( )

Nij q

j

j

i Nij q

j

u x

c

u

First clustering implementation impressions

Classical FCM generally uses Euclidean distance

Euclidean distance does not fit DNA Sequence-PWM scoring

A scoring function similar to previously mentioned scoring

scheme should be used

, , ,

, ,

0

,

( , ) 1/ ( ( , ) )

1( , )

0

A C G Tl

i i c i c

i c A

i

i i c

i

D x m e x m m

x ce x m

x c

x = DNA Sequence, m = PWM, l = length

First clustering implementation impressions

Additionally, updating clusters with the inputs is not a

straightforward task, either.

The x terms in the cluster update formula should be replaced

with the following R(x,c) function:

, , ,

0

1 2

1 2

1 2

( , ) ( , )

1( , )

0

A C G Tl

i

i c A

R x c eq x c

c ceq c c

c c

x = DNA Sequence, c = PWM (Cluster)

First clustering implementation impressions

A Sample Dataset : Genomic sequences from

the organism Saccharomyces Cerevisiae

Page 15: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 15

First clustering implementation impressions

Performance Measures

First clustering implementation impressions

Comparison of performances of FCM, MEME and MDScan in

terms of Mathews Correlation Coefficient (MCC)

First clustering implementation impressions

Visual results, and Sequence logo format

First clustering implementation impressions

FCM implementation conclusions – Machine learning especially clustering techniques

present a promising direction for DNA motif

discovery

– FCM has the potential to outperform the well-known

MEME and MDSCan

– Results are good at lower organisms but algorithm

is not proven at higher organisms in which many

motif discovery algorithms also fail

– No perfect accuracy yet

Further ideas

How to improve the first clustering based

implementation (FCM) ?

– Utilizing some other clustering algorithms instead of

FCM

Self-Organizing Map (SOM)

K-Means

Gaussian Mixture Models / Expectation Maximization

– To take advantage of different clustering algorithms

for different datasets

Overall performances of the algorithms

Page 16: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 16

Sequence logos of the known motifs and the predicted ones Further ideas

How to improve the first clustering based

implementation (FCM) ?

– Another alternative approach:

Post-processing clustering results with an optimizer

– Methods

Particle Swarm Optimization (PSO)

Genetic Algorithm (GA)

– Problems:

Proper fitness function

Is “Information-content” sufficient ?

Further ideas

How to improve the first clustering based implementation (FCM) ? – An alternative approach:

Developing ensemble methods from existing clustering algorithms

Ensemble methods are proven to be effective for motif-finding task. Out of many relevant papers:

– Yanover et al, M are better than one: an ensemble-based motif finder and its application to regulatory element prediction (2009)

– Yang and Kihara, EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences (2006)

– Romer et al, WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches (2007)

Further ideas

How to improve the first clustering based

implementation (FCM) ?

– Another alternative approach:

Post-processing clustering results with an optimizer

– Methods

Particle Swarm Optimization (PSO)

Genetic Algorithm (GA)

Combining learning with an optimizer

Particle Swarm Optimization (PSO)

PSO is a robust stochastic optimization technique based on the movement and intelligence of swarms.

PSO applies the concept of social interaction to problem solving.

It was developed in 1995 by James Kennedy (social-psychologist) and Russell Eberhart (electrical engineer).

It uses a number of agents (particles) that constitute a swarm moving around in the search space looking for the best solution.

Each particle is treated as a point in a N-dimensional space which adjusts its “flying” according to its own flying experience as well as the flying experience of other particles.

• Each particle keeps track of its coordinates in the solution

space which are associated with the best solution (fitness)

that has achieved so far by that particle. This value is called

personal best , pbest.

• Another best value that is tracked by the PSO is the best

value obtained so far by any particle in the neighborhood of

that particle. This value is called gbest.

• The basic concept of PSO lies in accelerating each particle

toward its pbest and the gbest locations, with a random

weighted accelaration at each time step as shown in Fig.1

Particle Swarm Optimization (PSO)

96

Page 17: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 17

Fig.1 Concept of modification of a

searching point by PSO

sk : current searching point.

sk+1: modified searching point.

vk: current velocity.

vk+1: modified velocity.

vpbest : velocity based on pbest.

vgbest : velocity based on gbest

sk

vk

vpbest

vgbest

sk+1

vk+1

sk

vk

vpbest

vgbest

sk+1

vk+1

x

y Particle Swarm Optimization (PSO) PSO-Motif finding

In the literature, PSO is used for motif finding – Optimization as a post processing operation

– Standalone application

– Sample papers Particle Swarm Optimisation for Protein Motif Discovery, In:

Genetic Programming and Evolvable Machines, 5(2):203—214 [2004]

Improved Hidden Markov Model training for multiple sequence alignment by a particle swarm optimization-evolutionary algorithm hybrid.. In: Biosystems, 72(1-2):5—17 [2003]

Identification of Transcription Factor Binding Sites Using Hybrid Particle Swarm Optimization, In: Proc. 10th International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing (RSFDGrC 2005). Volume 3642 of LNCS. pp. 438—445 [2005].

Datasets Methods

Motif Width Cons. DatasetLeng

th

PSO (Best)*

PSO (B.Ring)

GAME MEME BioPros.

Short High Small 0,80 0,80 0,75 0,85 0,78

Short Low Small 0,54 0,50 0,30 0,39 0,39

Short High Large 0,84 0,84 0,83 0,83 0,76

Short Low Large 0,46 0,46 0,36 0,42 0,45

Long High Small 0,96 0,96 0,97 0,98 0,97

Long Low Small 0,85 0,85 0,82 0,88 0,83

Long High Large 0,98 0,98 0,98 0,98 0,96

Long Low Large 0,90 0,90 0,90 0,90 0,80

AVERAGE 0,82 0,79 0,79 0,78 0,74

Performance comparison of motif-finding tools for

synthetic datasets Performance of PSO per number of particles

Conclusion

Machine learning methods specifically clustering algorithms are efficient means for DNA Motif Discovery.

Clustering algorithms should be further studied to improve standalone DNA motif-finding performance

Ensemble methods are proven to be efficient for DNA motif finding task

Optimizers such as PSO and GA may help Machine learning methods – As a post-processor

– Combined for learning

THANKS

Çukurova University Research Fund

TUBITAK – EEEAG- TBAG

for supporting the collaborations

ERASMUS

ICGEB - International Centre for Genetic Engineering and Biotechnology

Page 18: MACHINE LEARNING for Outline COMPUTATIONAL BIOLOGY · MACHINE LEARNING for COMPUTATIONAL BIOLOGY ... the main problems in drug design. ... Border Vectors Detection and Adaptation

5/31/2014

Cukurova University ---------------------

University of Aveiro 18