surechem and chembl acs cinf webinar john p ......surechembl ligand structures from patent...
TRANSCRIPT
SureChem and ChEMBL
ACS CINF webinar
John P. Overington & Nicko Goncharoff
8th April 2014
Bioactivity data
Compound
Ass
ay/T
arge
t
>Thrombin
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE
RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT
NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT
TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT
THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY
CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF
EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR
WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR
ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA
NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG
PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE
ChEMBL – Data for Drug Discovery3. Insight, tools and resources for translational drug discovery
2. Organization, integration, curation and standardization of pharmacology data
1. Scientific facts
Ki = 4.5nM
APTT = 11 min.
Overview of EMBL-EBI Chemistry Resources
UniChem – InChI-based resolver (full + relaxed ‘lenses’)
ChEMBL
Bioactivity data from literature
and depositions
ChEBI
Structures, metadata
for metabolites.
Chemical Ontology
Atlas
Ligand induced
transcript response
PDBe
Ligand structures
from structurally
defined protein
complexes
SureChEMBL
Ligand structures
from patent literature
~70M
ChEMBL• The world’s largest
primary public database of medicinal chemistry data– ~1.4 million compounds,
~9,000 targets, ~12 million bioactivities
• Truly Open Data - CC-BY-SA license
• Many download/access formats– Semantic Web
• RDF download, SPARQL endpoint at http://rdf.ebi.ac.uk/chembl
– ChEMBL Applicances• myChEMBL – linux VM• ChEMpi – raspberry pi
• EMBL-EBI acquired the SureChem product from Digital Science– State-of-the-art chemistry
patent product– 15 million chemical structures– Automatically extracted
chemical structures from full-text patent
• Research community wants open access to patent data – Patent literature 2-3 years
ahead of published literature – Better competitive position
• Plan to provide ongoing free, Open resource to entire community
SureChEMBL
SureChEMBL Overview
WO
EPApplications& Granted
USApplications & granted
JPAbstracts
Patent Offices
Processed patents
Name to Structure (five methods)
Image to Structure(one method)
Database
Chemistry Database
Patent PDFs
Application Server
Entity Recognition
Users
API
SureChem System – Amazon Web Services
Molfiles in patent
Immediate Priorities
• Migrate working pipeline across to EMBL-EBI servers
• Establish new account system
• Migrate current user accounts
• Offer GUI access at SureChem Pro equivalent level
• Turn off API access and refactor new API in OpenPHACTS framework
– Partners in OpenPHACTS will get early test access and input into development pipeline
– Build RDF version of SureChEMBL
Future Plans
• Dependent on funding and interest!– Add sequence searching
– Add disease term, animal disease model, etc. indexing
– KNIME/Pipeline Pilot nodes
– Add links to/from Europe PMC
– Extend image extraction retrospectively from 2006• spot pricing compute from AWS
– Provide weekly/monthly feed of patent structures to PubChem and ChemSpider
– Add chemical structure tagging & search to full text content of Europe PMC
– Develop UniChem VM for in-house private patent alerting using feed of SureChEMBL data
The search interfaceKeyword search Filter by authority
Structure sketch
Filter by document sectionhelp
Paste SMILES, MOL, name
Types of chemistry
search
Filter by
date
http://www.surechembl.org/
help
Patent number search
Keyword-based search
Example Searchesroche OR novartisC07D048704sterili?ekinase*Pfizer C07D “kinase inhibitor”pn: WO2011058149A1pa:(bayer OR astra OR Genentech OR merck) AND desc:(chemotherap* AND(Phosphoinositide kinases~3 OR Pi3K))
http://support.surechem.com/knowledgebase/articles/92016-lucene-query-field-names-and-examples
Fielded keyword search
Keyword search Filter by document section
Logical operators
Patent number search
Patent number search
Chemistry-based search
Structure sketch
Paste SMILES, MOL, name
Types of search
Filter by MW range
Filter by document
section
Example searches
• Retrieve all antimalarial small molecule US patents
– ic:C07D AND ic:A61P003306 AND pnctry:US
• Retrieve a specific patent
– pn:WO2011058149A1
• Similarity search (sildenafil nearest neighbours)
– Paste CCCc1nn(C)c2C(=O)NC(=Nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4
Example search
Review the hits
Review the hits
Select a subset of hits
Export hits (Pro user)
Property range filters
Count filters
Select a subset of hits
Review patent documents
Retrieve patent families
Review patent documents
Retrieve chemistry (Pro user)
Property range filters
Count filters
Summary
• Searching capabilities
– Free text keywords and Lucene fields
– Patent IDs & bibliographic information
– Patent authority & date
– Structure
• Retrieving capabilities
– Retrieve chemistry (with additional filters)
– Retrieve patent family information
– Retrieve annotated full patent text
Any questions?
• http://chembl.blogspot.co.uk/
• http://chembl.blogspot.co.uk/search/label/Webinar