searching chemical structures in patents using bayesian statistics - biovia european user forum 2017

29

Upload: willem-van-hoorn

Post on 22-Jan-2018

44 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017
Page 2: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Patent Searching using

Bayesian Statistics

Willem van Hoorn, Exscientia Ltd

Biovia European Forum, London, June 2017

Page 3: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Contents

Who are we?

Searching molecules in patents

What can Pipeline Pilot do for you?

And what it cannot

Proof of concept implementation

3

Page 4: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Artificial Intelligence

Specialist drug design

algorithms evolve, design, and

propose what to make

Intelligence Augmentation

Empowered humans make final

decisions and

oversee strategyAI IA

Multiple applications for small molecule discovery

Pioneers of “Centaur” drug discoveryThe best of Artificial Intelligence and Human drug discovery expertise

combined to deliver radical improvements in productivity

Single target compoundsBispecific small

moleculesPhenotypic drug design

4http://drug.design

Page 5: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

5

Page 6: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Growth Through Collaborations

2012

Bispecific small

molecules –

immuno-oncology

Bispecific small

molecule discovery

agreement in metabolic

disease

Bispecific small

molecules targeting

2 distinct GPCRs –

CNS disease

Phenotypic Drug

Discovery

platform

First identification

of bispecfic small

molecules for dual

targets

Company

founded

Multi-target discovery

agreement

First milestone

reached – first

candidate delivered

2013 2016 20172014 2015

Automated design

for single targets

Global pharma(to be announced Q2

2017)

£4M 50:50 £240M+Value potential

6

Page 7: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Searching structures in patents

“I have a brilliant new compound idea, is it

novel?”

Patent text searching is easy (Google)

Molecular structure searching is not

For a quick answer: manual substructure

search in SciFinder, SureChEMBL, etc

Issue: need a substructure query

This is an issue if you want to do this a lot,

automatically

7

Page 8: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Searching patents - assumptions

Structures from patents are available

Claimed compounds in patent have

common substructure

Claimed structures are novel

This looks like a set amenable to

modelling

8

Page 9: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

A trivial example

9

Example patent (random): ‘EP-2513087-A1’

Bayesian model: ‘Good’: structures from EP-2513087-A1

‘Bad’: structures from random 200k patents

High scoring molecule shown

Red atoms: high contribution to score (‘novel’)

Phenyl is common and has low contribution

In a similarity search both would be equally important

Page 10: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Learn Molecular Categories

Training set

Set of molecules from multiple categories

Typical user case: activity classes

Here: category = patent

Model prediction

Probabilities molecule belongs to category

What distinguishes molecule of category A from

molecule of category B

Prediction made on full structure

10

Page 11: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Example with four categories

Each row is a

fingerprint

feature

Each column is a category

11

Cell = NormalizedProbability ~ weight of fingerprint

Pos: feature associated with being in this class

Neg: feature associated with not being in this class

Insignificant: -0.05 < value < 0.05

Page 12: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

A patent model

Downloaded SureChembl mapping files (Oct 2015)

~1.7M patents with ~20M Claimed structures (including

duplicates)

Selected random 1000 patents

Random split: 80% training (152k), 20% test set (38k)

Multi-category Bayesian model based on ECFP_6

A model with 997 categories was derived.

Some patents contained few compounds, none left in 80%

Predicted patent for test set

12

Page 13: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Result: ~75% ranks in top 3

13

Rank of true patent in top 997

Frequency

(log scale)

This looks promising…

Page 14: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Fails: reagents, wrong structures

14

Corpus frequency: Bayesian score

Note these are Claimed structures (really?)

If these are the failures I am not worried

Page 15: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

The model matrix does not scale

Exponential growth: adding a category adds a

column, adding a compound adds ≥ 0 rows (most

likely >0)

15

Building a full patent model runs out of memory

Page 16: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Redundancy in multi-cat models

8878 rows x 4 categories = 35,512 values

But only 1074 unique & significant

values (|NP| > 0.05)

16

Page 17: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Store the model in a database

Databases are designed to deal with large

data volumes

Normalisation removes redundancy

Indexing for fast lookup

sql query to evaluate model

Need the normalised probabilities

And building 1.7M models is too slow

17

Page 18: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Bayes in PilotScript

18

This is much faster since it skips the publication of the model in xmldb

Page 19: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Does scripted model work?

For a random patent (EP-2513087-A1)

Create model by script and “Learn Good Molecules”

Evaluate on SampleDrugs.sd

Scores are close (not identical!) but correlate well

19

Script

Com

ponent

Script Component

Page 20: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Model stored in database

20

Runtime: ~10.5 hours

(Core i7 laptop, 8 CPU, 32 Gb)

Page 21: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Model scores by sql query*

21

Runtime: 2-3 minutes per compound

select

pt.sc_patent_id,

sum(np.normalised_probability) as score

from

eps_normalised_probabilities np,

eps_fingerprints fp,

eps_patents pt

where

np.fingerprint_id = fp. fingerprint_id

and

np.patent_id = pt. patent_id

and

fp.feature in (‘fingerprints of query structure’)

group by

pt.number

order by

score desc

* Need to correct for features not in this patent, see slide in backups

Page 22: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

A test

• Top ranked patent lists (all?) known PDE5 inhibitors in Claim

• All other patents also contain Sildenafil

• However: original Sildenafil patent not found?22

Query: Sildenafil

Top-ranked patents:

Page 23: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

US-5250534-A

Original Sildenafil patent

It is not in the 1.7M patents model

Yet it is in SureChEMBL, with structures

In downloaded copy, all structures are

annotated as ‘Description’, not ‘Claims’.

Therefore ignored…

This is frustrating

23

Page 24: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Conclusions

Have prototype that shows technique

works

However it is not yet useful

SureChEMBL data not yet accurate

enough (Oct 2015)

Claims vs Description, etc

Multiple structures are wrong

24

Page 25: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

EXSCIENTIA LTD, LAB 12 - DUNDEE INCUBATOR

JAMES LINDSAY PLACE, DD1 5JJ, UNITED KINGDOM

[email protected]

Page 26: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

SureChEMBL

26

Page 27: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Searching virtual libraries

27

Training set:

~100k compounds, ~5k categories

Page 28: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

Evaluate model scoresFor each compound, fingerprints in @fingerprint_id_in_clause:

1. get sum(NP) for known fp

Keep top 1000

2. get array of known FP 4. correct for unknown fingerprints

3. get Laplacian, Pactive for patent

1 2 3 4

get patent name

28

Runtime: 2-3 minutes

Page 29: Searching chemical structures in patents using Bayesian statistics - Biovia European User Forum 2017

29