transformative utility of inchikey searching in the mother of all databases
DESCRIPTION
From BioIT Workshop "A Bar Code for Chemical Structures: Using the InChI to Transform Connectivity between Chemistry, Biology, Biomedicine and Drug Discovery" http://figshare.com/articles/BioIT_Workshop_2014_Chem_Bio_via_InChI/1063314 Update June. Workshop attendees had access to all the slide sets via CHI. Some are on slideshare (e.g. from Antony Williams) but I have merged the sets into a PDF in the figshare link above. Abstract: Google indexing of the InChIKey (IK) has turned the web into a de facto chemical database with well over 50 million unique entries (PMID:23399051). The first block of the IK encodes molecular skeleton that can be used to give maximum recall of related structures. For example, Google searching XUKUURHRXDUEBC from atorvastatin displays ~200 low-redundancy links in ~0.3 sec with a low false-positive rate . These include most major databases and less familiar but valuable sources. The simplicity of the IK makes it useful for those less familiar with chemical searching. Advanced Google Search can be used to filter results, image searching gives complementary coverage and there are also hits in Google Scholar. IK searching thus becomes powerfully enabling for reciprocal document-to-database joins from legacy text tombs including over 50 years of biology < > chemistry. Open tools such as chemicalize.org can generate of IKs from patents, papers, abstracts or web pages. Open Drug Discovery data on tested, synthesized or even proposed compounds, can be globally connected in real-time by surfacing IKs in open laboratory notebooks, Wikis, blogs, Twitter, figshare etc. Following the ChemSpider precedent the IUPHAR/GTP database offers users IK Google searches from all ligand entries including peptides.TRANSCRIPT
![Page 1: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/1.jpg)
1
www.guidetopharmacology.org
The transformative utility of InChIKey searching in the Mother of all Databases
(a.k.a. Google)Chris Southan
IUPHAR/BPS Guide to PHARMACOLOGY Web portal Group, Centre for Integrative Physiology, School of Biomedical Sciences, University of Edinburgh,
Hugh Robson Building, Edinburgh, EH8 9XD, UK. [email protected]
![Page 2: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/2.jpg)
2
Outline
• Introduction: the atorvastatin example• Chem-to-bio context• IK stats and estimates • Extracting IKs from documents • IK database-to-database• Open Source malaria drug discovery as a testbed• Caveats and future prospects
![Page 3: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/3.jpg)
3
The precedent
InChI as a web index for molecules
“We have now discovered, serendipitously, that these InChIs have been comprehensively and accurately indexed by the Google search engine. From preliminary exploration it appears that every known document in which an InChI appears has been indexed and that all are retrievable by standard queries with virtually 100% precision. This means that standard Web-based indexers, without any alteration, are capable of acting as completely precise chemical search engines. Although we have many years of developing chemistry on the web, this was an unexpected and very welcome finding”
Murray-Rust et al. 2004 http://lists.w3.org/Archives/Public/public-swls-ws/2004Oct/att-0019/
![Page 4: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/4.jpg)
4
IK example: atorvastatin and metabolites
![Page 5: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/5.jpg)
5
Fast and clean results
parentpara-hydroxy
ortho-hydroxy
![Page 6: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/6.jpg)
6
Inner layer XUKUURHRXDUEBC image search
![Page 7: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/7.jpg)
7
Making the chem < > bio join
BiochemistryMedicinal chemistry
ToxicologyChemical biology
Systems pharmacologyMetabolomicsDrug discoveryPharmacology
Chemogenomics
InChIKey
![Page 8: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/8.jpg)
8
Getting biology out of text-tombs is not easy;Getting chemistry out is even more difficult
![Page 9: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/9.jpg)
9
Why chem < > bio joining is difficult
• The majority of chemistry embedded in biological reports is specified as semantic names or images
• The MeSH to PubChem connectivity is patchy• Biologists use sequence database accession numbers, ontologies
and gene names widely but chemists rarely use open chemical database IDs
• Most bioactive chemistry in text does not have direct connectivity to databases (unlike GenBank/RefSeq/UniProt < > PubMed)
• Nat.Chem.Biol. is the only bio-journal that mandates PubChem reciprocal linking
• Most authors don’t engage with surfacing and connectivity (e.g. becoming PubChem submitters and/or figshare data depositors)
• Chemists and biologists tend not to communicate easily• GenBank started in 1982, PubChem in 2004• Inventors/authors under-cite their own medicinal chemistry patents
![Page 10: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/10.jpg)
10
So how many IKs has Google indexed ?
• PubChem ~ 50 million • ChemSpider ~ 30 million • PubChem from patents (all sources) ~ 15
million• PubChem journal sources (PubMed + ChEMBL)
~ 1 million• Web sources outside the above (no idea) • Open ELNs (no idea)
Guestimate 60 million-ish
![Page 11: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/11.jpg)
11
Databases < > documents:IK Googling facilitates reciprocal linking
Abstracts
Patents
Papers
15 mill
0.2 mill (mainly MeSH)
0.9 mill (ChEMBL)
12K
![Page 12: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/12.jpg)
12
IKs with data-supported bioactivity (>biology)
• GVKBIO Online Structure Activity Relationship Database (GOSTAR ) = 6.3 million with SAR data from patents and literature (not tagged in PubChem)
• Thomson Pharma = 4.2 million selected examples from patents and literature
• PubChem BioAssay “active” = 0.93 million • ChEMBL (in PubChem) = 0.88 million • Thomson Pharma (2013 only) = 0.27 million• PubMed = 0.23 million • MeSH “pharmacology” = 12,719• INN or USAN = 10,707• Union of last two above = 19,334 intersect = 4,092• Prous (Thomson) Drugs of the Future = 7,218• DrugBank approved (via SIDs) = 1,504
Guestimate for chemistry with a useful level of solubility, stability, specificity and potency (e.g. < 250 nM) in biological systems ~ 0.5 million IKs (but of course we also need low potency and inactives for controls and SAR)
![Page 13: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/13.jpg)
13
IKs and the representational hextet used in documents and databases
![Page 14: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/14.jpg)
14
Extracting IKs from documents: OPSIN
![Page 15: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/15.jpg)
15
Extracting IKs from documents: chemicalize.org
![Page 16: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/16.jpg)
16
Extracting IKs from documents OSRA
![Page 17: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/17.jpg)
17
Extracting IKs from documents: sketchers
![Page 18: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/18.jpg)
18
IK call-outs in dbs: extending the link reach
![Page 19: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/19.jpg)
19
Modified peptides/big stuff: connection where similarity struggles
http://www.guidetopharmacology.org/GRAC/LigandDisplayForward?ligandId=2532
![Page 20: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/20.jpg)
20
OSM drug discovery: test bed for open data surfacing and connecting chem > bio
• Team are exploring chemistry surfacing/sharing in real time (e.g. ELNs, Wiki, Github, ChEMBLMalaria for project updates)
• Converted to IK utility (after the necessary evangelizing) • Global antimalarial drug R&D (open and closed) exemplifies
full range of connectivity issues that IK surfacing can potentially ameliorate
![Page 21: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/21.jpg)
21
Actively unlocking IK connections
![Page 22: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/22.jpg)
22
Name > structure > biology: missing links
![Page 23: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/23.jpg)
23
Where the IK connects……
![Page 24: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/24.jpg)
24
Chemicalize.org: 413 strucs/IKs from WO2011086531
CID 53311393 ->
![Page 25: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/25.jpg)
25
WO2011086531 >chemicalize.org > SAR IC50s > figshare
surfaces and connects (e.g. PubChem)
![Page 26: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/26.jpg)
26
Share structures via open MyNCBI
http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZbIouGfUdsdbHek5/.
![Page 27: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/27.jpg)
27
DIY surfacing of name < > IK connections
![Page 28: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/28.jpg)
28
Caveats and risks for IK Googling
• Ranking heuristics are opaque and change• Results shift on short time scales (i.e. irreproducible)• No API (or good search result set parsers) • Don’t ignore corroborative searches in well-structured
databases• Searching common IKs is not generally useful (but can filter)• No good for similarity searching on their own (but you can
intersperse with similarity approaches)• In the relentless war between good and evil (Google verses
the SEO Dark Side) dodgy chemical suppliers are always pushing
• There may be future risks of common chemistry swamping• Names, SMILES or even IUPAC strings may sometimes give
Google hits where the IK misses (because its not there)
![Page 29: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/29.jpg)
29
What does the future hold /need ?
• For manual searching Googling the IK is the “first stop shop”• InChI world-domination is proceeding• Inexorable increase in full-text, open access journals and crawled
open repositories (e.g. figshare)• Journals must encourage author chemistry mark-up to include the IK• More biologists getting into chemistry connections and databases• Boutique bioactive chemistry databases becoming more discoverable• SureChEMBL will improve image handling and get crawled• RSC Journal Archive > ChemSpider• ContentMine (Murry-Rust et. al.) 100 million facts, including journal-
extracted chemical structure streaming• More Open (Source) Drug Discovery > Google crawled ELNs with IKs• Wider community use of Chemicalize.org for targeted extractions• New IK via source expansion in ChemSpider and PubChem
![Page 30: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/30.jpg)
30
Thanks and Questions
![Page 31: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/31.jpg)
31
Extras
![Page 32: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/32.jpg)
32
Abstract
Abstract: Google indexing of the InChIKey (IK) has turned the web into a de facto chemical database with well over 50 million unique entries (PMID:23399051). The first block of the IK encodes molecular skeleton that can be used to give maximum recall of related structures. For example, Google searching XUKUURHRXDUEBC from atorvastatin displays ~200 low-redundancy links in ~0.3 sec with a low false-positive rate . These include most major databases and less familiar but valuable sources. The simplicity of the IK makes it useful for those less familiar with chemical searching. Advanced Google Search can be used to filter results, image searching gives complementary coverage and there are also hits in Google Scholar. IK searching thus becomes powerfully enabling for reciprocal document-to-database joins from legacy text tombs including over 50 years of biology < > chemistry. Open tools such as chemicalize.org can generate of IKs from patents, papers, abstracts or web pages. Open Drug Discovery data on tested, synthesized or even proposed compounds, can be globally connected in real-time by surfacing IKs in open laboratory notebooks, Wikis, blogs, Twitter, figshare etc. Following the ChemSpider precedent the IUPHAR/GTP database offers users IK Google searches from all ligand entries including peptides.
![Page 33: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/33.jpg)
33
Patent SAR from WO2011086531:Collating activities via SureChemOpen
CID 53311393 >
![Page 34: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/34.jpg)
34
Triaging document or webpage chemistry
• Identify the structure specification types– Semantic names (all sources)– Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts)– Images (papers, patents, & Google images)– SMILES (open lab books)– InChi strings (open lab books)– SDF files (open lab books, & github)
Convert these to a structure (e.g. SDF, SMILES, InChI) then:– Search InChIKey in Google– Search major databases– Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity
searching
![Page 35: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/35.jpg)
35
Orthogonal joining
![Page 36: Transformative Utility of InChIKey Searching in the Mother of all Databases](https://reader033.vdocuments.mx/reader033/viewer/2022052822/554e7c9bb4c90545698b507b/html5/thumbnails/36.jpg)
36
Triage example: a new antimalaria
The MMV390048 code name is linked to an image in press reports but is PubChem and PubMed -ve