surechem and chembl acs cinf webinar john p ......surechembl ligand structures from patent...

27
SureChem and ChEMBL ACS CINF webinar John P. Overington & Nicko Goncharoff 8 th April 2014

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

SureChem and ChEMBL

ACS CINF webinar

John P. Overington & Nicko Goncharoff

8th April 2014

Page 2: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Bioactivity data

Compound

Ass

ay/T

arge

t

>Thrombin

MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE

RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT

NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT

TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT

THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY

CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF

EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR

WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR

ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA

NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG

PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE

ChEMBL – Data for Drug Discovery3. Insight, tools and resources for translational drug discovery

2. Organization, integration, curation and standardization of pharmacology data

1. Scientific facts

Ki = 4.5nM

APTT = 11 min.

Page 3: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Overview of EMBL-EBI Chemistry Resources

UniChem – InChI-based resolver (full + relaxed ‘lenses’)

ChEMBL

Bioactivity data from literature

and depositions

ChEBI

Structures, metadata

for metabolites.

Chemical Ontology

Atlas

Ligand induced

transcript response

PDBe

Ligand structures

from structurally

defined protein

complexes

SureChEMBL

Ligand structures

from patent literature

~70M

Page 4: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

ChEMBL• The world’s largest

primary public database of medicinal chemistry data– ~1.4 million compounds,

~9,000 targets, ~12 million bioactivities

• Truly Open Data - CC-BY-SA license

• Many download/access formats– Semantic Web

• RDF download, SPARQL endpoint at http://rdf.ebi.ac.uk/chembl

– ChEMBL Applicances• myChEMBL – linux VM• ChEMpi – raspberry pi

Page 5: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

• EMBL-EBI acquired the SureChem product from Digital Science– State-of-the-art chemistry

patent product– 15 million chemical structures– Automatically extracted

chemical structures from full-text patent

• Research community wants open access to patent data – Patent literature 2-3 years

ahead of published literature – Better competitive position

• Plan to provide ongoing free, Open resource to entire community

SureChEMBL

Page 6: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

SureChEMBL Overview

WO

EPApplications& Granted

USApplications & granted

JPAbstracts

Patent Offices

Processed patents

Name to Structure (five methods)

Image to Structure(one method)

Database

Chemistry Database

Patent PDFs

Application Server

Entity Recognition

Users

API

SureChem System – Amazon Web Services

Molfiles in patent

Page 7: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Immediate Priorities

• Migrate working pipeline across to EMBL-EBI servers

• Establish new account system

• Migrate current user accounts

• Offer GUI access at SureChem Pro equivalent level

• Turn off API access and refactor new API in OpenPHACTS framework

– Partners in OpenPHACTS will get early test access and input into development pipeline

– Build RDF version of SureChEMBL

Page 8: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Future Plans

• Dependent on funding and interest!– Add sequence searching

– Add disease term, animal disease model, etc. indexing

– KNIME/Pipeline Pilot nodes

– Add links to/from Europe PMC

– Extend image extraction retrospectively from 2006• spot pricing compute from AWS

– Provide weekly/monthly feed of patent structures to PubChem and ChemSpider

– Add chemical structure tagging & search to full text content of Europe PMC

– Develop UniChem VM for in-house private patent alerting using feed of SureChEMBL data

Page 9: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

The search interfaceKeyword search Filter by authority

Structure sketch

Filter by document sectionhelp

Paste SMILES, MOL, name

Types of chemistry

search

Filter by

date

http://www.surechembl.org/

help

Patent number search

Page 10: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Keyword-based search

Example Searchesroche OR novartisC07D048704sterili?ekinase*Pfizer C07D “kinase inhibitor”pn: WO2011058149A1pa:(bayer OR astra OR Genentech OR merck) AND desc:(chemotherap* AND(Phosphoinositide kinases~3 OR Pi3K))

http://support.surechem.com/knowledgebase/articles/92016-lucene-query-field-names-and-examples

Page 11: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Fielded keyword search

Keyword search Filter by document section

Logical operators

Page 12: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Patent number search

Page 13: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Patent number search

Page 14: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Chemistry-based search

Structure sketch

Paste SMILES, MOL, name

Types of search

Filter by MW range

Filter by document

section

Page 15: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Example searches

• Retrieve all antimalarial small molecule US patents

– ic:C07D AND ic:A61P003306 AND pnctry:US

• Retrieve a specific patent

– pn:WO2011058149A1

• Similarity search (sildenafil nearest neighbours)

– Paste CCCc1nn(C)c2C(=O)NC(=Nc12)c3cc(ccc3OCC)S(=O)(=O)N4CCN(C)CC4

Page 16: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Example search

Page 17: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Review the hits

Page 18: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Review the hits

Page 19: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Select a subset of hits

Page 20: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Export hits (Pro user)

Property range filters

Count filters

Page 21: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Select a subset of hits

Page 22: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Review patent documents

Page 23: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Retrieve patent families

Page 24: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Review patent documents

Page 25: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Retrieve chemistry (Pro user)

Property range filters

Count filters

Page 26: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Summary

• Searching capabilities

– Free text keywords and Lucene fields

– Patent IDs & bibliographic information

– Patent authority & date

– Structure

• Retrieving capabilities

– Retrieve chemistry (with additional filters)

– Retrieve patent family information

– Retrieve annotated full patent text

Page 27: SureChem and ChEMBL ACS CINF webinar John P ......SureChEMBL Ligand structures from patent literature ~70M ChEMBL •The world’s largest primary public database of medicinal chemistry

Any questions?

• http://chembl.blogspot.co.uk/

• http://chembl.blogspot.co.uk/search/label/Webinar

[email protected]