searching chemical structures in patents using bayesian statistics - biovia european user forum 2017

Patent Searching using

Bayesian Statistics

Willem van Hoorn, Exscientia Ltd

Biovia European Forum, London, June 2017

Contents

Who are we?

Searching molecules in patents

What can Pipeline Pilot do for you?

And what it cannot

Proof of concept implementation

3

Artificial Intelligence

Specialist drug design

algorithms evolve, design, and

propose what to make

Intelligence Augmentation

Empowered humans make final

decisions and

oversee strategyAI IA

Multiple applications for small molecule discovery

Pioneers of “Centaur” drug discoveryThe best of Artificial Intelligence and Human drug discovery expertise

combined to deliver radical improvements in productivity

Single target compoundsBispecific small

moleculesPhenotypic drug design

4http://drug.design

http://drug.design/

Growth Through Collaborations

2012

Bispecific small

molecules –

immuno-oncology

Bispecific small

molecule discovery

agreement in metabolic

disease

Bispecific small

molecules targeting

2 distinct GPCRs –

CNS disease

Phenotypic Drug

Discovery

platform

First identification

of bispecfic small

molecules for dual

targets

Company

founded

Multi-target discovery

agreement

First milestone

reached – first

candidate delivered

2013 2016 20172014 2015

Automated design

for single targets

Global pharma(to be announced Q2

2017)

£4M 50:50 £240M+Value potential

6

Searching structures in patents

“I have a brilliant new compound idea, is it

novel?”

Patent text searching is easy (Google)

Molecular structure searching is not

For a quick answer: manual substructure

search in SciFinder, SureChEMBL, etc

Issue: need a substructure query

This is an issue if you want to do this a lot,

automatically

7

Searching patents - assumptions

Structures from patents are available

Claimed compounds in patent have

common substructure

Claimed structures are novel

This looks like a set amenable to

modelling

8

A trivial example

9

Example patent (random): ‘EP-2513087-A1’

Bayesian model: ‘Good’: structures from EP-2513087-A1

‘Bad’: structures from random 200k patents

High scoring molecule shown

Red atoms: high contribution to score (‘novel’)

Phenyl is common and has low contribution

In a similarity search both would be equally important

Learn Molecular Categories

Training set

Set of molecules from multiple categories

Typical user case: activity classes

Here: category = patent

Model prediction

Probabilities molecule belongs to category

What distinguishes molecule of category A from

molecule of category B

Prediction made on full structure

10

Example with four categories

Each row is a

fingerprint

feature

Each column is a category

11

Cell = NormalizedProbability ~ weight of fingerprint

Pos: feature associated with being in this class

Neg: feature associated with not being in this class

Insignificant: -0.05 < value < 0.05

A patent model

Downloaded SureChembl mapping files (Oct 2015)

~1.7M patents with ~20M Claimed structures (including

duplicates)

Selected random 1000 patents

Random split: 80% training (152k), 20% test set (38k)

Multi-category Bayesian model based on ECFP_6

A model with 997 categories was derived.

Some patents contained few compounds, none left in 80%

Predicted patent for test set

12

Result: ~75% ranks in top 3

13

Rank of true patent in top 997

Frequency

(log scale)

This looks promising…

Fails: reagents, wrong structures

14

Corpus frequency: Bayesian score

Note these are Claimed structures (really?)

If these are the failures I am not worried

The model matrix does not scale

Exponential growth: adding a category adds a

column, adding a compound adds ≥ 0 rows (most

likely >0)

15

Building a full patent model runs out of memory

Redundancy in multi-cat models

8878 rows x 4 categories = 35,512 values

But only 1074 unique & significant

values (|NP| > 0.05)

16

Store the model in a database

Databases are designed to deal with large

data volumes

Normalisation removes redundancy

Indexing for fast lookup

sql query to evaluate model

Need the normalised probabilities

And building 1.7M models is too slow

17

Bayes in PilotScript

18

This is much faster since it skips the publication of the model in xmldb

Does scripted model work?

For a random patent (EP-2513087-A1)

Create model by script and “Learn Good Molecules”

Evaluate on SampleDrugs.sd

Scores are close (not identical!) but correlate well

19

Script

Com

ponent

Script Component

Model stored in database

20

Runtime: ~10.5 hours

(Core i7 laptop, 8 CPU, 32 Gb)

Model scores by sql query*

21

Runtime: 2-3 minutes per compound

select

pt.sc_patent_id,

sum(np.normalised_probability) as score

from

eps_normalised_probabilities np,

eps_fingerprints fp,

eps_patents pt

where

np.fingerprint_id = fp. fingerprint_id

and

np.patent_id = pt. patent_id

and

fp.feature in (‘fingerprints of query structure’)

group by

pt.number

order by

score desc

* Need to correct for features not in this patent, see slide in backups

A test

• Top ranked patent lists (all?) known PDE5 inhibitors in Claim

• All other patents also contain Sildenafil

• However: original Sildenafil patent not found?22

Query: Sildenafil

Top-ranked patents:

US-5250534-A

Original Sildenafil patent

It is not in the 1.7M patents model

Yet it is in SureChEMBL, with structures

In downloaded copy, all structures are

annotated as ‘Description’, not ‘Claims’.

Therefore ignored…

This is frustrating

23

Conclusions

Have prototype that shows technique

works

However it is not yet useful

SureChEMBL data not yet accurate

enough (Oct 2015)

Claims vs Description, etc

Multiple structures are wrong

24

EXSCIENTIA LTD, LAB 12 - DUNDEE INCUBATOR

JAMES LINDSAY PLACE, DD1 5JJ, UNITED KINGDOM

[email protected]

SureChEMBL

26

Searching virtual libraries

27

Training set:

~100k compounds, ~5k categories

Evaluate model scoresFor each compound, fingerprints in @fingerprint_id_in_clause:

1. get sum(NP) for known fp

Keep top 1000

2. get array of known FP 4. correct for unknown fingerprints

3. get Laplacian, Pactive for patent

1 2 3 4

get patent name

28

Runtime: 2-3 minutes

searching chemical structures in patents using bayesian statistics - biovia european user forum 2017

Data & Analytics