an hv-svm classifier to infer tf-tf interactions using protein domains and go annotations

36
An HV-SVM Classifier to Infer TF-TF Interactions Using Protein Domains and GO Annotations Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Junxiang Lee (National University of Singapore) Bharadwaj Veeravalli (National

Upload: long

Post on 31-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

An HV-SVM Classifier to Infer TF-TF Interactions Using Protein Domains and GO Annotations. Xiao-Li Li Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore Joint work with Junxiang Lee (National University of Singapore) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

An HV-SVM Classifier to Infer TF-TF Interactions Using Protein

Domains and GO Annotations

Xiao-Li Li Institute for Infocomm Research, Singapore

Nanyang Technological University, Singapore

Joint work with Junxiang Lee (National University of Singapore) Bharadwaj Veeravalli (National University of Singapore) See-Kiong Ng (Institute for Infocomm Research)

Page 2: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Outlines

• 1. Introduction

• 2. Existing work

• 3. The proposed techniques

• 4. Experiments

• 5. Conclusions

Page 3: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

1. Introduction

3 Key Bio-molecules in a cell

• DNA: The blueprint. Encodes all the information

to produce an organism

• RNA: The messenger. Transmits DNA

information to the protein synthesis machinery.

• Proteins: The workers. Targets for most drugs.

Page 4: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

The Central Dogma

DNA RNA Protein

Transcription Translation

Cell

PPI

Genes transcribed to RNA: transfer genetic information from DNA into RNA. RNA translated to protein: cells build proteins through protein biosynthesis.Protein is the machinery of life: Proteins function by interacting with other proteins and biomolecules.

Page 5: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Genetic Information Flow in a cell

Transcription: DNA to RNA

Translation: RNA to Protein

Page 6: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Transcriptional Regulation

This slide is from Li Shen: CSB Tutorial, UCSD

Page 7: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Combinatorial Nature of TFs

• Recruiting of RNA Polymerase requires the cooperation of several different transcription factors. – TFIID is first positioned at TATA box of DNA– TFIIA, TFIIB, etc then need to attached to DNA– Next, RNA Polymerase bounds– Other TFs complete the mature of transcription

complex

Page 8: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

2. Existing works: Computationally Infer TF-TF Interactions

• Exploits the abundance of gene expression data, inferring synergistic relationships between TFs

• A pair of TFs (A, B) is considered to be interacting if the expression correlation scores of genes binding both A and B is significantly greater than any set of gene with binding of either TF A or B alone.

ECS(A, B, G)>>max(ECS(A,G), ECS(B,G))ECS(A, B, G)>>max(ECS(A,G), ECS(B,G))

Pilpel (2001), Nature Genet ; Bussemaker (2001), Nature GenetYu, (2003), Trends Genet ; Banerjee, (2003), NAR

Page 9: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Existing works (Cont.)

• Exploit the growing availability of whole genome sequences. Sophisticated sequence analysis methods are employed to discover the so-called interacting motif pairs in the DNA sequences.

• Two motifs are deemed interacting if they co-occurrence in the input DNA promoters are over-represented and the distance between the two motifs are significantly different from random expectations.

• The TFs binding to these motifs are then predicted to be interacting with each other.

Conlon (2003), PNAS; Das (2004), PNAS ; Yu, (2006), NAR;

Page 10: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Existing works (Cont.)

• Exploit the growing availability of large-scale protein-protein interaction networks.

• The working hypothesis here is that proteins that are close to each other in the PPI networks are more likely to be co-regulated by the same set of TFs.

• TFs A and B are considered to be interacting when the median distance between protein pairs controlled by both A and B is significantly shorter than protein pairs controlled by A only or B only.

Nagamine, (2005), NAR

Page 11: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Existing works (Cont.)

• Machine learning algorithms have been used to exploit the current abundant availability of PPI data to build models to classify novel protein interactions. – Feature representation– SVM, NB, etc

• Most of these works had focused on predicting generic PPI’s. It would be interesting if we could also employ a machine learning approach for predicting the more biologically specific TF-TF interactions.

Bock (2001), Bioformatics; Wu, (2005), NAR; Ben-Hur, (2005), Bioformatics.

Page 12: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

3. The Proposed Technique

• 3.1 TF characterization

• 3.2 SVM classification

• 3.3 Kernel design– Vertical Pairwise Kernel – Horizontal Pairwise Kernel – Combined kernel

• 3.4 Genetic algorithm (GA)

Page 13: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

3.1 TF characterization

• We characterize TFs according to their protein domains and biological attributes, which include molecular functions, biological processes, as well as cellular components.

Page 14: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Protein Domains

• They are evolutionarily conserved modules of amino acid sub-sequence, i.e. “building blocks” for constructing different proteins.

• The existence of certain domains in the TFs could indicate the propensity for the TFs to interact (DDI).

Page 15: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

GO Annotations

• GO provides a common vocabulary that describes the biological processes, molecular functions and cellular components for many bio-molecules.

• Physical interactions between TFs require that they exist in close proximity in a cell.

• Biologically, TFs that have the same molecular functions, involved in the same biological processes, and located in the same cellular components are more likely to interact.

Page 16: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

3.2 Support Vector Machines

• SVM is related to statistical learning theory, and it was first introduced in 1992.

V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.

Class 1

Class 2• Consider a binary

classification problem

• Many decision boundaries!

• Are all decision boundaries

equally good?

Page 17: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Large-margin Decision Boundary

• The decision boundary should be as far away as possible from the data of both classes

• We should maximize the margin,

Class 1

Class 2

m

Page 18: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Finding the Decision Boundary• Let {x1, ..., xn} be our data set and let yi {1,-

1} be the class label of xi

• The decision boundary should classify all points correctly

• The decision boundary can be found by solving the following constrained optimization problem

Page 19: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

The Dual Problem

• This is a quadratic programming (QP) problem• Many approaches, such as Loqo, cplex have been proposed

and a global maximum of i can always be found

– see http://www.numerical.rl.ac.uk/qp/qp.html

•Inner product: a similarity measure between the objects.•Different kernels can be designed, e.g. using domain knowledge to better address the problem!!!

Page 20: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

6=1.4

A Geometrical Interpretation

Class 1

Class 2

1=0.8

2=0

3=0

4=0

5=07=0

8=0.6

9=0

10=0

Page 21: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

3.3 Kernel design:SVM Default Linear Kernel Kd

• Need to define pairwise kernel (TF pairs, not single TF)• A simplistic approach is to arrange the TFs in an

alphabetically ordered list. Hence given TF pairs (A, B) and (C, D), where A<B and C<D, the pairwise kernel is defined as:

DBCAd ΦΦDCBAK ,, )),(),,((

TTTT, YXcYXfYXpYXdYX wwwwΦ ccffppdd

Page 22: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

What is wrong with default kernel?

Vertical Pairwise Kernel Kv

)()( )),)(,(( ,,,, CBDADBCAv ΦΦΦΦDCBAK

Page 23: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Proposed Horizontal Pairwise Kernel Kh

• A pairwise kernel is to represent a pair explicitly as one object and measure the similarity directly between both pairs.

TF pair AB A A

Common components in ABCommon components in AB

Common functions in ABCommon functions in AB

Common processes in ABCommon processes in AB

Domain pairs in ABDomain pairs in AB

BBTF pair CD

C C

Common components in CDCommon components in CD

Common functions in CDCommon functions in CD

Common processes in CDCommon processes in CD

Domain pairs in CDDomain pairs in CD

D D

TTTT )),(),,(( DCBAcDCBAfDCBApDCBAdh wwwwDCBAK ccffppdd

Page 24: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Combining to build HV kernel Khv

and learning feature weights • Combining vertical and horizontal kernels

as following HV kernel

where Kh and Kv refer to the horizontal and vertical kernels while wh and wv are their respective kernel weights.

• How to decide the kernel weights wh and wv and feature weights

)),(),,(( )),(),,(()),)(,(( DCBAKwDCBAKwDCBAK hhvvHV

Page 25: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

3.4 Genetic algorithm optimizes the kernel and feature weights

1) Randomly generate an initial population of k chromosomes;2) Evaluate the fitness, f, of each individual;3) Form k/2 random pairs from the population and select the

fitter individual of each pair as parents to breed the next generation. Repeat to obtain a total of k parents;

4) Breed a new generation of offspring through crossover of the k parents;

5) Perform random mutation on the newly created offspring with a mutation rate r;

6) Repeat Steps 2-5 for n generations;7) Select fittest chromosome from the n generations as solution

to the search problem.

Page 26: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

4. Experimental Results• We collected TF-TF interaction data for Homo sapiens

(Human) and Mus Musculus (Mouse) from various databases, including – IntAct (http://www.ebi.ac.uk/intact/index.html),

– GRIP (http://biodata.mshri.on.ca/grid/servlet/Index/),

– MINT (http://mint.bio.uniroma2.it/mint/),

– BIND (http://bind.ca/),

– DIP (http://dip.doe-mbi.ucla.edu/).

• In all, a total of 3224 TF-TF interaction pairs among 619 TFs were extracted from these public databases.

Page 27: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Positive and Negative Dataset

• The positive dataset: the extracted 3224 TF-TF interaction pairs formed.

• The negative dataset: similar size of TFs was constructed by randomizing pairs of TFs that do not already exist in the positive dataset.

• Perform five-fold cross validation to test the experimental results

Page 28: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Comparison between Feature Combinations

• We conducted an experiment to determine the effectiveness of the different features in predicting TF-TF interactions.

• All 15 combinations of the 4 features (D: domains, P: biological processes, F: molecular functions, C: cellular components) were tested over different kernels.

Page 29: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Performance of default kernel Kd for different feature combinations

60

65

70

75

80

85

90D P F C

DP

DF

DC

PF

PC

FC

DP

F

DP

C

DF

C

PF

C

DP

FC

Feature combinations for default SVM kernel

Pe

rfo

rma

nc

e

F-measure Accuracy AUC

Page 30: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Performance of different kernels using all features with equal weights.

65

70

75

80

85

90

95

F-measure Accuracy AUC Precision Recall

Pe

rfo

rma

nc

e

Kd Kv Kh Kh+v

Page 31: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Optimization of Kernel and Feature WeightsWeights Optimization Using GA

83.00

83.50

84.00

84.50

85.00

85.50

86.00

0 10 20 30 40 50Generation

F-m

easu

re

maxmedianmin

Applying our optimized HV-SVM classifier with kernel Khvo on test set achieved 85.7% F-measure, which is 1.0% higher than kernel Khv and 2.5% higher than randomly assigned weights for the initial population.

Page 32: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Performance Comparison of Various Classifiers

Classifier AUC Accuracy F-Measure

SVM with Kd 88.98 81.61 81.75

SVM with Kv 89.98 82.57 82.71

SVM with Kh 84.99 76.77 74.13

HV-SVM with Khv 91.80 84.77 84.68

HV-SVM withKhvo 92.76 85.24 85.70

Naïve Bayes 85.23 78.88 79.49

Page 33: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Conclusions

• To understand the complex mechanisms behind transcription regulation, it is imperative to unravel the coordinated interactions of the TFs.

• We characterized the TFs using Pfam domains and GO annotations. Our results that integrating multiple biological evidences improves the prediction of TF-TF interactions.

Page 34: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Conclusions (Cont.)• We specifically designed two novel pairwise

kernels for predicting TF-TF interactions – The vertical pairwise kernel measures similarity

across individual TFs between two TF pairs– The horizontal pairwise kernel considers similarity

between two TF pairs by measuring the similarity between the feature pairs of the two sets of TFs.

– Genetic algorithm was then employed to learn the kernel and features weights of the kernel combination to give the best results.

Page 35: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Future Work

• A future area of work would be to incorporate the current three classes of TF-TF interaction prediction techniques, namely gene expression correlation based techniques, interacting motif based techniques and PPI network based techniques, into our HV-SVM machine learning approach for even better predictions.

Page 36: An HV-SVM Classifier to Infer  TF-TF Interactions Using Protein Domains and GO Annotations

Thank you for your attention!