in silico virtual screening for drug...

In silico virtual screening for drug discovery

Jean-Philippe [email protected]

Mines ParisTech - INSERM - Institut Curie

"Mathematics and Industry" workshop, IHES, Bures-sur-Yvette,France, April 23-25, 2009

Jean-Philippe Vert (ParisTech) In silico virtual screening IHES 1 / 23

Drug discovery

A long, expensive and risky processOn average 15 years and $800 millionsHigh attrition rate: for 10,000 molecules tested, 10 make it toclinicals, 1 to the market.>70% of the costs are wasted on failures


Computational approaches

The use of computers and computational methods permeates allaspects of drug discovery today, in particular for:

Target identificationStructure prediction, virtual screening (docking)Prediction of drug-likeliness of compounds


Collaboration with the industry

PracticallyDirect conctracting (leveraged through Institut Carnot M.I.N.E.S.)European projectsPôles de compétitivité

FeaturesLarge pharmaceutical or small biotech companiesShared PI or subcontractingCultural shock math/info vs bio/chemistry


Virtual screening and predictive models

ObjectiveBuild models to predict biochemical properties Y of small moleculesfrom their structures X , using a training set of (X , Y ) pairs.

Structures X

C15H14ClN3O3

Properties Ybinding to a therapeutic target,pharmacokinetics (ADME),toxicity...


Example : ligand-Based Virtual Screening

inactive

active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).Jean-Philippe Vert (ParisTech) In silico virtual screening IHES 6 / 23

Formalization

The problemGiven a set of training instances (x1, y1), . . . , (xn, yn), where xi ’sare graphs and yi ’s are continuous or discrete variables of interest,Estimate a function

y = f (x)

where x is any graph to be labeled.This is a classical regression or pattern recognition problem overthe set of graphs.


Pattern recognition


Classical approaches

Two steps1 Map each molecule to a vector of fixed dimension using molecular

descriptorsGlobal properties of the molecules (mass, logP...)2D and 3D descriptors (substructures, fragments, ....)

2 Apply an algorithm for regression or pattern recognition.PLS, ANN, ...

Example: 2D structural keys

O

N

O

O

OO

N N N

O O

O


Which descriptors?

O

N

O

O

OO

N N N

O O

O

DifficultiesMany descriptors are needed to characterize various features (inparticular for 2D and 3D descriptors)But too many descriptors are harmful for memory storage,computation speed, statistical estimation


Kernels

DefinitionLet Φ(x) = (Φ1(x), . . . ,Φp(x)) be a vector representation of themolecule xThe kernel between two molecules is defined by:

K (x , x ′) = Φ(x)>Φ(x ′) =

p∑i=1

Φi(x)Φi(x ′) .

φX H


The kernel trick

φX H

The trick1 Any positive definite function is a valid kernel (i.e., inner product

after mapping the molecules to some Hilbert space), e.g.,

K (x , x ′) = exp(−γ‖ x − x ′ ‖2

).

2 Many linear algorithms for regression or pattern recognition canbe expressed only in terms of kernels.


Example: Support Vector Machine

φ

minimizen∑

i=1

n∑j=1

αiαjK (xi , xj)−n∑

i=1

αi

subject to 0 ≤ αi ≤ C, i = 1, . . . , n ,n∑

i=1

αiyi = 0 .


Making kernels for molecules

Strategy 1: use well-known molecular descriptors to representmolecules m as vectors Φ(m), and then use kernels for vectors,e.g.:

K (m1, m2) = Φ(m1)>Φ(m2).

Strategy 2: invent new kernels to do things you can not do withstrategy 1, such as using an infinite number of descriptors. We willnow see two examples of this strategy, extending 2D and 3Dmolecular descriptors.

φX H


Example: 2D fragment kernel

. . . . . .C NCCON

C C NO C

CO C

CC CCN C

NO CC C C

CC CC C C

CN CC C C

N

O

O

O N

O

C. . . . . . . . .

φd(x) is the vector of counts of all fragments of length d :

φ1(x) = ( #(C),#(O),#(N), ...)>

φ2(x) = ( #(C-C),#(C=O),#(C-N), ...)> etc...

The 2D fragment kernel is defined, for λ < 1, by

Kfragment(x , x ′) =∞∑

d=1

r(λ)φd(x)>φd(x ′) .


2D kernel computation trick

Kfragment can be computed efficiently for various weights r(λ)although the feature space has infinite dimension.Rephrase the kernel computation as that as counting the numberof walks on a graph (the product graph)

G1 x G2

c

d e43

2

1 1b 2a 1d

1a 2b

3c

4c

2d

3e

4e

G1 G2

a b

The infinite counting can be factorized

λA + λ2A2 + λ3A3 + . . . = (I − λA)−1 − I .


Experiments

MUTAG datasetaromatic/hetero-aromatic compoundshigh mutagenic activity /no mutagenic activity, assayed inSalmonella typhimurium.188 compouunds: 125 + / 63 -

Results10-fold cross-validation accuracy

Method AccuracyProgol1 81.4%2D kernel 91.2%


Example: 2D subtree kernel

.

.

.

.

.

.

.

.

.N

N

C

CO

C

.

.

. C

O

C

N

C

N O

C

N CN C C

N

N


2D Subtree vs fragment kernels (Mahé and V, 2007)

7072

7476

7880

AU

CWalksSubtrees

CC

RF

−C

EM

HL−

60(T

B)

K−

562

MO

LT−

4R

PM

I−82

26 SR

A54

9/A

TC

CE

KV

XH

OP

−62

HO

P−

92N

CI−

H22

6N

CI−

H23

NC

I−H

322M

NC

I−H

460

NC

I−H

522

CO

LO_2

05H

CC

−29

98H

CT

−11

6H

CT

−15

HT

29K

M12

SW

−62

0S

F−

268

SF

−29

5S

F−

539

SN

B−

19S

NB

−75

U25

1LO

X_I

MV

IM

ALM

E−

3M M14

SK

−M

EL−

2S

K−

ME

L−28

SK

−M

EL−

5U

AC

C−

257

UA

CC

−62

IGR

−O

V1

OV

CA

R−

3O

VC

AR

−4

OV

CA

R−

5O

VC

AR

−8

SK

−O

V−

378

6−0

A49

8A

CH

NC

AK

I−1

RX

F_3

93S

N12

CT

K−

10U

O−

31P

C−

3D

U−

145

MC

F7

NC

I/AD

R−

RE

SM

DA

−M

B−

231/

AT

CC

HS

_578

TM

DA

−M

B−

435

MD

A−

NB

T−

549

T−

47D

Screening of inhibitors for 60 cancer cell lines (from Mahé and V.,2008)


Example: 3D pharmacophore kernel (Mahé et al.,2005)

O

O

2

d1

d3

d

O

O

2

d1

d3

d

K (x , y) =∑

px∈P(x)

∑py∈P(y)

exp (−γd (px , py )) .

Results (accuracy)Kernel BZR COX DHFR ER2D (Tanimoto) 71.2 63.0 76.9 77.13D fingerprint 75.4 67.0 76.9 78.63D not discretized 76.4 69.8 81.9 79.8


Challenges in ligand-based virtual screening

C15H14ClN3O3

How to embed molecules in a Hilbert space through a positivedefinite function K (x , x ′) in a computationally efficient way (e.g.,complete graph kernels are at least as hard as the graphisomorphism problem)?How to embed flexible 3D structures?How to extend machine learning algorithms to non-positivedefinite functions (e.g., edit distance between graphs)?


Towards chemogenomics

The problemSimilar targets bind similar ligandsInstead of focusing on each target individually, can we screen thebiological space (target families) vs the chemical space (ligands)?Mathematically, learn f (target , ligand) ∈ {bind , notbind} : this isan instance of operator inference.


Conclusion

Computational approaches have the potential to increase theglobal R&D productivity in pharmaceutical industry, in particular tospeed up the identification of targets, hits, and decrease theattrition rate.Collaboration with the industry is needed to access largedatabases of molecules and impact drug discovery.Statistics and machine learning are natural approaches forligand-based virtual screening approachesMolecules are complex objects, which can be represented asformula, structures, shapes. Applying state-of-the-art machinelearning methods to these representations requires mathematicaland computational developments.


in silico virtual screening for drug...

Documents