searching chemical structures in patents using bayesian statistics - biovia european user forum 2017
TRANSCRIPT
Patent Searching using
Bayesian Statistics
Willem van Hoorn, Exscientia Ltd
Biovia European Forum, London, June 2017
Contents
Who are we?
Searching molecules in patents
What can Pipeline Pilot do for you?
And what it cannot
Proof of concept implementation
3
Artificial Intelligence
Specialist drug design
algorithms evolve, design, and
propose what to make
Intelligence Augmentation
Empowered humans make final
decisions and
oversee strategyAI IA
Multiple applications for small molecule discovery
Pioneers of “Centaur” drug discoveryThe best of Artificial Intelligence and Human drug discovery expertise
combined to deliver radical improvements in productivity
Single target compoundsBispecific small
moleculesPhenotypic drug design
4http://drug.design
5
Growth Through Collaborations
2012
Bispecific small
molecules –
immuno-oncology
Bispecific small
molecule discovery
agreement in metabolic
disease
Bispecific small
molecules targeting
2 distinct GPCRs –
CNS disease
Phenotypic Drug
Discovery
platform
First identification
of bispecfic small
molecules for dual
targets
Company
founded
Multi-target discovery
agreement
First milestone
reached – first
candidate delivered
2013 2016 20172014 2015
Automated design
for single targets
Global pharma(to be announced Q2
2017)
£4M 50:50 £240M+Value potential
6
Searching structures in patents
“I have a brilliant new compound idea, is it
novel?”
Patent text searching is easy (Google)
Molecular structure searching is not
For a quick answer: manual substructure
search in SciFinder, SureChEMBL, etc
Issue: need a substructure query
This is an issue if you want to do this a lot,
automatically
7
Searching patents - assumptions
Structures from patents are available
Claimed compounds in patent have
common substructure
Claimed structures are novel
This looks like a set amenable to
modelling
8
A trivial example
9
Example patent (random): ‘EP-2513087-A1’
Bayesian model: ‘Good’: structures from EP-2513087-A1
‘Bad’: structures from random 200k patents
High scoring molecule shown
Red atoms: high contribution to score (‘novel’)
Phenyl is common and has low contribution
In a similarity search both would be equally important
Learn Molecular Categories
Training set
Set of molecules from multiple categories
Typical user case: activity classes
Here: category = patent
Model prediction
Probabilities molecule belongs to category
What distinguishes molecule of category A from
molecule of category B
Prediction made on full structure
10
Example with four categories
Each row is a
fingerprint
feature
Each column is a category
11
Cell = NormalizedProbability ~ weight of fingerprint
Pos: feature associated with being in this class
Neg: feature associated with not being in this class
Insignificant: -0.05 < value < 0.05
A patent model
Downloaded SureChembl mapping files (Oct 2015)
~1.7M patents with ~20M Claimed structures (including
duplicates)
Selected random 1000 patents
Random split: 80% training (152k), 20% test set (38k)
Multi-category Bayesian model based on ECFP_6
A model with 997 categories was derived.
Some patents contained few compounds, none left in 80%
Predicted patent for test set
12
Result: ~75% ranks in top 3
13
Rank of true patent in top 997
Frequency
(log scale)
This looks promising…
Fails: reagents, wrong structures
14
Corpus frequency: Bayesian score
Note these are Claimed structures (really?)
If these are the failures I am not worried
The model matrix does not scale
Exponential growth: adding a category adds a
column, adding a compound adds ≥ 0 rows (most
likely >0)
15
Building a full patent model runs out of memory
Redundancy in multi-cat models
8878 rows x 4 categories = 35,512 values
But only 1074 unique & significant
values (|NP| > 0.05)
16
Store the model in a database
Databases are designed to deal with large
data volumes
Normalisation removes redundancy
Indexing for fast lookup
sql query to evaluate model
Need the normalised probabilities
And building 1.7M models is too slow
17
Bayes in PilotScript
18
This is much faster since it skips the publication of the model in xmldb
Does scripted model work?
For a random patent (EP-2513087-A1)
Create model by script and “Learn Good Molecules”
Evaluate on SampleDrugs.sd
Scores are close (not identical!) but correlate well
19
Script
Com
ponent
Script Component
Model stored in database
20
Runtime: ~10.5 hours
(Core i7 laptop, 8 CPU, 32 Gb)
Model scores by sql query*
21
Runtime: 2-3 minutes per compound
select
pt.sc_patent_id,
sum(np.normalised_probability) as score
from
eps_normalised_probabilities np,
eps_fingerprints fp,
eps_patents pt
where
np.fingerprint_id = fp. fingerprint_id
and
np.patent_id = pt. patent_id
and
fp.feature in (‘fingerprints of query structure’)
group by
pt.number
order by
score desc
* Need to correct for features not in this patent, see slide in backups
A test
• Top ranked patent lists (all?) known PDE5 inhibitors in Claim
• All other patents also contain Sildenafil
• However: original Sildenafil patent not found?22
Query: Sildenafil
Top-ranked patents:
US-5250534-A
Original Sildenafil patent
It is not in the 1.7M patents model
Yet it is in SureChEMBL, with structures
In downloaded copy, all structures are
annotated as ‘Description’, not ‘Claims’.
Therefore ignored…
This is frustrating
23
Conclusions
Have prototype that shows technique
works
However it is not yet useful
SureChEMBL data not yet accurate
enough (Oct 2015)
Claims vs Description, etc
Multiple structures are wrong
24
EXSCIENTIA LTD, LAB 12 - DUNDEE INCUBATOR
JAMES LINDSAY PLACE, DD1 5JJ, UNITED KINGDOM
SureChEMBL
26
Searching virtual libraries
27
Training set:
~100k compounds, ~5k categories
Evaluate model scoresFor each compound, fingerprints in @fingerprint_id_in_clause:
1. get sum(NP) for known fp
Keep top 1000
2. get array of known FP 4. correct for unknown fingerprints
3. get Laplacian, Pactive for patent
1 2 3 4
get patent name
28
Runtime: 2-3 minutes
29