acs san diego, march 2012, inchi symposium
DESCRIPTION
TRANSCRIPT
Accessing NCI/CADD Web Resources by InChI
Markus SitzmannComputer-Aided Drug Design Group, Chemical Biology Laboratory, Frederick National Laboratory for Cancer Research, NIH, DHHS
http://cactus.nci.nih.gov
Chemical Identifier Resolver (CIR)
http://cactus.nci.nih.gov/chemical/structure
CIR works as a resolver for different chemical structure identifiers orrepresentations. It allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
Chemical Structure Representations
chemical structureNCI/CADD Identifiers
InChI/InChIKey
ChemSpider ID
PubChem SID/CID
chemical names
CAS Registry Number
NSC number
FDA UNII
ChemNavigator SID
SMILES
SD File
Chemical FormulaChEBI ID
PDB Ligand ID
MRV
CML
SYBYL Line Notation
GIF image
Chemical Structure Representations
InChINCI/CADD Identifiers
InChI/InChIKey
ChemSpider ID
PubChem SID/CID
chemical names
CAS Registry Number
NSC number
FDA UNII
ChemNavigator SID
SMILES
SD File
Chemical FormulaChEBI ID
PDB Ligand ID
MRV
CML
SYBYL Line Notation
GIF image
many more …
Chemical Structure Databases
InChI
Chemical Identifier Resolver (CIR)
http://cactus.nci.nih.gov/chemical/structure
CIR works as a resolver for different chemical structure identifiers orrepresentations. It allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
http://cactus.nci.nih.gov/chemical/structure
Chemical Identifier Resolver (CIR)
Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
C7H6O2APtclcactv03051222202D 0 0.00000 0.00000 15 15 0 0 0 0 0 0 0 0999 V2000 2.8660 -2.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 -0.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 0.9400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 -2.6800 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.2690 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.2690 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.4631 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.4631 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 2.0600 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 2 3 1 0 0 0 0 3 4 2 0 0 0 0 4 5 1 0 0 0 0 5 6 2 0 0 0 0 1 6 1 0 0 0 0 4 7 1 0 0 0 0 7 8 1 0 0 0 0 7 9 2 0 0 0 0 1 10 1 0 0 0 0 2 11 1 0 0 0 0 3 12 1 0 0 0 0 5 13 1 0 0 0 0 6 14 1 0 0 0 0 8 15 1 0 0 0 0M END$$$$SD file
ChemWriter Editor
WPYMKLBDIGXBTP-FZOZFQFYNA-N
http://cactus.nci.nih.gov/chemical/structure
Chemical Identifier Resolver (CIR)
Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
benzoic acid65-85-0WLN: QVRUnisept BZAAIDS018010Salvo liquidBenzoic acid-ring-UL-14CST5213864BenzoesaeureCHEBI:30746NSC 149benzenecarboxylic acidphenylformic acidBenzoic acid (JP15/USP)Benzoic acid (TN)18102_RIEDELAromatic hydroxy acidBenzoic acid (7CI,8CI,9CI)Benzoic acid [USAN:JAN]W213128_ALDRICH47849_SUPELCOAcide benzoique [French]Acido benzoico [Italian]Benzoate (VAN)Benzoesaeure [German]Benzoic acid (natural)Acide benzoiqueBenzeneformic acidBenzenemethanoic acidBenzoesaeure GKBenzoesaeure GVBenzoic acid, tech.CarboxybenzeneKyselina benzoovaPhenylcarboxylic acidnames
ChemWriter Editor
WPYMKLBDIGXBTP-FZOZFQFYNA-N
http://cactus.nci.nih.gov/chemical/structure
Chemical Identifier Resolver (CIR)
Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier.
InChIKey=WPYMKLBDIGXBTP-UHFFFAOYSA-NInChI=1S/C7H6O2/c8-7(9)6-4-2-1-3-5-6/h1-5H,(H,8,9)C1=CC=C(C=C1)C(O)=O
ChemWriter Editor
WPYMKLBDIGXBTP-FZOZFQFYNA-N
InChIKeyInChI
SMILES
Chemical Identifier Resolver (CIR)
programmatic URL API:
http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”
if a request is not successful: HTTP404 status message
Chemical Identifier Resolver (CIR)
programmatic URL API:
http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”
if a request is not successful: HTTP404 status message
http://cactus.nci.nih.gov/chemical/structure/PGZUMBJQJWIWGJ-ONAKXNSWSA-N/cas
204255-11-8 MIME type: text/plain
examples:
http://cactus.nci.nih.gov/chemical/structure/PGZUMBJQJWIWGJ-ONAKXNSWSA-N/image
MIME type: image/gif
Chemical Identifier Resolver (CIR)
• access by programming libraries/languages (e.g. Python):
• access from Unix shell level (e.g., via wget):
shell > wget -qO - \http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas204255-11-8
from urllib2 import *url = “http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas”resolver = urlopen(url) try:
response = resolver.read() except HTTPError:
raise “your own error handling”print response204255-11-8
InChI/InChIKey(trivial) names
CAS Registry numbers
IUPAC names (OPSIN)
structure images(GIF, PNG)
chemical properties(MW, formula, …)
Database RegIDs(PubChem, ZINC, eMolecules, ChemSpider ID)
structure files (sdf, pdb, cdx, …)
SMILES
Chemical Identifier Resolver: InChI/InChIKey
CIR
chemical namesIUPAC names (OPSIN)
CAS numbersSMILES strings
IUPAC InChI/InChIKeysNCI/CADD Identifiers
CACTVS HASHISYNSC number
PubChem SIDZINC Code
ChemSpider IDChemNavigator SID
eMolecule VID
/smiles/names, /iupac_name/cas/inchi, /stdinchi/inchikey, /stdinchikey/ficts, /ficus, /uuuuu /image/file, /sdf/mw, /monoisotopic_mass /formula/twirl/urls/chemspider_id/pubchem_sid/chemnavigator_sid
“identifier” “representation”
http://cactus.nci.nih.gov/chemcial/structure
CIR
Chemical Identifier Resolver (CIR)
identifier representation
http request
http response
detection ofthe identifier
type
detection ofthe identifier
type
identifier is afull structure
representation(e.g. SMILES, InChI)
calculation of therequested structure
representation
calculation of therequested structure
representation
identifier is ahashed structure
representation(e.g. InChIKey),
chemical name etc.
database lookup
MIME type
structure
e.g. InChI, GIF image
e.g. CAS number,chemical name
Chemical Identifier Resolver (CIR)
CSDB
Chemical Structure Database (CSDB)
• ChemNavigator iResearch Librarycompilation of commercially available screeningcompounds from ~300 international chemistrysuppliers
• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIV database, NIST Webbook, NLM ChemIDplus, ChemSpider, …
• Commercial Sources / othersAsinex, Comgenex, eMolecules, …
ChemNav.iResearch Lib.~56%
PubChem~38%
others
~6%
140 chemical structure databases120 million structure records
84.6 million unique structures by FICuS110 million Standard InChIKeys for lookup
current status: (as of March 2010)
• PubChem Substance & Compound as separate databases(both updated to 2012)
• ChemNavigator iResearch Library: updated to 2012• new databases, e.g.
• Therapeutic Target Database (TTD) • Human Metabolome Database (HMDB)• DrugBank
• “pull” download of databases also available in PubChem, e.g.• DSSTox, ZINC 2012/01, ChEBI 2012/01, ChEMBL13,
ChemIDplus 2012/01• to a limited extend “historic versions” of databases are archived,
e.g. comparison of PubChem Substance 2007 vs 2012 will be possible
Chemical Structure Database (Update 2012)
Chemical Structure Database (CSDB)
Chemical Structure Normalization
• calculation of a set of parent structures with differentsensitivity to chemical features:
structurenormalization
parentstructure
NCI/CADDIdentifier
hashcodecalculation
E_HASHISYFICTS
original structure
record
FICuS
uuuuu
MolfileSDFSMILESChemDraw cdxPDB
SDFSMILESdatabase
both the original structure record & the normalized parent structuresare archived in the database
Chemical Structure Database (CSDB)
NCI/CADD Identifiers (FICTS, FICuS, uuuuu)
HNN NH2
O-
ONa+
6C16DE2351F9FF50-FICTS
NNH NH2
OH
O
9850FD9F9E2B4E25-FICTS
HNN
OH
O
NH2HN
NOH
O
NH2HN
N NHOH
O
E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS
E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuSE5F83F10C5DB080A-FICuS
E5F83F10C5DB080A-FICTS
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
tautomer salt SRhistidine:
structure normalization:
based on CACTVS hashcodes (HASHISY)16-digit hexadecimal number (64-bit unsigned) HN
N NH2
OH
O
9850FD9F9E2B4E25
HNN NH2
O-
ONa+
6C16DE2351F9FF50-FICTS
NNH NH2
OH
O
9850FD9F9E2B4E25-FICTS
HNN
OH
O
NH2HN
NOH
O
NH2HN
N NHOH
O
E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS
E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuSE5F83F10C5DB080A-FICuS
E5F83F10C5DB080A-FICTS
9850FD9F9E2B4E25-uuuuu
9850FD9F9E2B4E25-FICuS
tautomer salt SRhistidine:
structure normalization:
based on CACTVS hashcodes (HASHISY)16-digit hexadecimal number (64-bit unsigned) HN
N NH2
OH
O
9850FD9F9E2B4E25
Chemical Structure Database (Update 2012)
FICTS
FICuS
uuuuu
~118 million
~115 million
~100 million
231 small-molecule database367 database releases (full, incremental, “historic versions”)324 million original database records
Chemical Structure Database (Update 2012)
Unique structure count:
Chemical Structure Database (Update 2012)
InChI/InChIKey
InChI/InChIKey (Version 1.04) calculated with four InChI flag sets:
Set 1
Set 2
Set 3
Standard Standard InChIKey
DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
Add H
Add H
Add H
Add H
CACTVS
:
:
:
:
Standard Set, Set 1 & Set 2: addition of hydrogen atoms by CACTVSSet 3: addition of hydrogen atoms by the InChI library
structurenormalization
parentstructure
NCI/CADDIdentifier
hashcodecalculation
E_HASHISYFICTS
original structure
record
FICuS
uuuuu
Chemical Structure Database (Update 2012)
InChI/InChIKey
• calculation of InChI/InChIKey Standard set, Set 1, Set 2 & Set 3for all original structure records and normalized parent structure:
Set 1 Set 2 Set 3Standard
InChI/InChIKey
Using CIR with InChI/InChIKey
Using CIR with InChI/InChIKey
(Partial) InChIKey Lookup
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles
CCO
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA/smiles`
CCOCC[OH2+]
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ/smiles
C(C(O)([2H])[2H])[2H]CC(O)([2H])[2H]C(CO)([2H])([2H])[2H]CC[17OH]C(CO)[2H][14CH3]COCCO
• resolve Standard InChIKey into full structure representation: Ethanol
Using CIR with InChI/InChIKey
Chemical File Representation
• available file format representations:
alc Alchemy formatcdxml CambridgeSoft ChemDraw XML formatcerius MSI Cerius II formatcharmm Chemistry at HARvardMacromolecular Mechanics file formatcif Crystallographic Information Filecml Chemical Markup Languagegjf Gaussian input data filegromacs GROMACS file formathyperchem HyperChem file formatjme Java Molecule Editor format
maestro Schroedinger MacroModelstructure file formatmol Symyx molecule filesybyl2/mol2 Tripos Sybyl MOL2 formatmrv ChemAxon MRV formatpdb Protein Data Banksdf Symyx Structure Data Formatsdf3000 Symyx Structure Data Format 3000sln SYBYL Line Notationsmiles SMILESxyz xyz file format
http://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/file?format=sdfAspirin
Using CIR with InChI/InChIKey
Chemical Structure Images (GIF, PNG)
http://cactus.nci.nih.gov/chemical/structure/XMWRBQBLMFGWIX-UHFFFAOYSA-N/image?height=300&width=300&bgcolor=black&bondcolor=white
http://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/image?height=200&width=200&symbolfontsize=7&footer="Aspirin"
Buckyball
Aspirin
Using CIR with InChI/InChIKey
3D Chemical Structure Visualization (TwirlyMol)
implemented by Noel O'Boyle (University College Cork, Ireland)
Chrome Safari FF3.6+ IE9 IE8 IE7 IE6
simple javascript that allows you to render a rotatable/zoomable3D representation of a molecule in your web browser
no plugin is needed, only a modern browser:
simple viewer:http://cactus.nci.nih.gov/chemical/structure/DDPJWUQJQMKQIF-XPNZOOHZSA-N/twirl
embedded into a web page:
<div id=“canvas” height=“400” width=“400”></div><script src=“http://cactus.nci.nih.gov/chemical/structure/
DDPJWUQJQMKQIF-XPNZOOHZSA-N/twirl_cached/canvas” />
Using CIR with InChI/InChIKey
3D Chemical Structure Visualization (TwirlyMol)
Restasis
http://www.coronene.com/blog/
http://chemical-quantum-images.blogspot.com
http://baoilleach.blogspot.com/
Using CIR with InChI/InChIKey
3D Chemical Structure Visualization (TwirlyMol)
Using CIR with InChI/InChIKey
Chemical Database URLs
<?xml version="1.0" encoding="UTF-8" ?> <request string="DDPJWUQJQMKQIF-XPNZOOHZSA-N" representation="urls">
<data id="1" resolver=“stdinchikey" string_class=“Standard InChIKey"><item id="1" classification="exact" database="ChemSpider" publisher="ChemSpider">
http://chemspider.com/structure.4939506</item><item id="2" classification="exact" database="ChemSpider“ publisher="PubChem">
http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=43028058</item><item id="3" classification="exact" database="NLM ChemIDplus" publisher="NLM">
http://chem.sis.nlm.nih.gov/chemidplus/direct.jsp?result=advanced®no=059865133[…]
</data></request>
• request database URLs:
http://cactus.nci.nih.gov/chemical/structure/DDPJWUQJQMKQIF-XPNZOOHZSA-N/urls/xmlRestasis
Using CIR with InChI/InChIKey
Chemical Name Lookup
• request (alternative) names:
<?xml version="1.0" encoding="UTF-8" ?> <request string=“BSYNRYMUTXBXSQ-UHFFFAOYSA-N" representation="names">
<data id="1" resolver=“stdinchikey" string_class=“Standard InChIKey"><item id="1" classification=“pubchem_iupac_name">2-acetyloxybenzoic acid</item><item id="2" classification="pubchem_iupac_openeye_name">2-Acetoxybenzoic acid</item><item id="3" classification="pubchem_generic_registry_name">50-78-2</item><item id="4" classification="pubchem_generic_registry_name">11126-35-5</item><item id="5" classification="pubchem_generic_registry_name">11126-37-7</item><item id="6" classification="pubchem_generic_registry_name">2349-94-2</item><item id="7" classification="pubchem_generic_registry_name">26914-13-6</item><item id="8" classification="pubchem_substance_synonym">NCGC00090977-04</item><item id="9" classification="pubchem_substance_synonym">KBioSS_002272</item><item id="10" classification="pubchem_substance_synonym">SBB015069</item><item id="11" classification="pubchem_substance_synonym">Aspirin</item><item id="12" classification="pubchem_substance_synonym">D00109</item>
[…]
http://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/names/xmlAspirin
Using CIR with InChI/InChIKey
Chemical Properties
• request molecular weight:
http://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/weight
180.1598
/mw molecular weight/formula formula/monoisotopic_mass monoisotopic mass/h_bond_donor_count H bond donor count/h_bond_acceptor_count H bond acceptor count/h_bond_center_count H bond center count/rotor_count number of rotatable bonds/effective_rotor_count number of effectively rotatable bonds/rule_of_5_violation_count number of Rule-of-5 violations/xlogp2 octanol−water partition coefficient XLOGP2
/aromatic compound is aromatic/macrocyclic compound is macrocyclic/heteroatom_count heteroatom count/hydrogen_atom_count H atom count/heavy_atom_count heavy atom count/deprotonable_group_count number of deprotonable groups/protonable_group_count number of protonable groups/ring_count number of rings/ringsys_count number of ringsystems
MIME type: text/plain
Aspirin
example: all chemical names that contain the words “morphine” and “methyl”(name pattern: ‘+morphine +methyl‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/stdinchikey/xml?resolver=name_pattern
Using CIR with InChI/InChIKey
Chemical Name Pattern Search
based on the open sourcefull text search server Sphinx(http://sphinxsearch.com)
• Google-like searches on CIR’s name index (approx. 70 million names)
<request string="+morphine +methyl" representation="stdinchikey"><data id="1" resolver="name_pattern" notation="Morphine 3-methyl ether">
<item id="1">InChIKey=OROGSEYTTFOCAN-DNJOTXNNSA-N</item></data><data id="2" resolver="name_pattern" notation="6-Methyl-delta(sup 6)-deoxy-morphine">
<item id="1">InChIKey=CUFWYVOFDYVCPM-GGNLRSJOSA-N</item></data><data id="3" resolver="name_pattern" notation="Morphine, dihydro-6-methyl-">
<item id="1">InChIKey=NBKVWIJQJMEQLE-NGTWOADLSA-N</item></data><data id="4" resolver="name_pattern“ notation="6-METHYL-MORPHINE ETHER">
<item id="1">InChIKey=FNAHUZTWOVOCTL-UHFFFAOYSA-N</item></data><data id="5" resolver="name_pattern" notation="Morphine alcoholic methyl ether">
<item id="1">InChIKey=FNAHUZTWOVOCTL-XSSYPUMDSA-N</item></data><data id="6" resolver="name_pattern" notation="N-Methyl morphine chloride">
<item id="1">InChIKey=MJNCZWBHCFTYFU-SCLAZZCHSA-N</item></data><data id="7" resolver="name_pattern" notation="Morphine, 7-hydroxy-6,6-dimethoxy-3-O-methyl-">
<item id="1">InChIKey=URFKRBIESURBKC-UHFFFAOYSA-N</item></data>
</request>
Search name pattern ‘+morphine +methyl’: 7 matching names
Using CIR with InChI/InChIKey
Chemical Name Pattern Search
example: chemical names that contain the words “morphine” and “methyl”but not “hydroxyl” (name pattern: ‘+morphine +methyl -hydroxyl‘): http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -hydroxyl/stdinchikey/xml?resolver=name_pattern
example: chemical names that contain the substring “morphine”somewhere in the name (name pattern: ‘*morphine*‘) http://cactus.nci.nih.gov/chemical/structure/*morphine*/stdinchikey/xml?resolver=name_pattern
example: chemical names that contain a single character “m” and the word “benzene” in a maximum distance of 3 words (finds meta-substituted aromaticcompounds, name pattern: ‘“m benzene”~3‘):http://cactus.nci.nih.gov/chemical/structure/(m benzene)~3/stdinchikey/xml?resolver=name_pattern
6 matching names
45 matching names
22 matching names
Structure Normalization(Tautomerism)
Structure Normalization
Tautomerism
rule 12: furanones
rule 11: 1.11 (aromatic) heteroatom H shiftrule 10: 1.9 (aromatic) heteroatom H shiftrule 9: 1.7 (aromatic) heteroatom H shiftrule 8: 1.5 aromatic heteroatom H shift (2)rule 7: 1.5 (aromatic) heteroatom H shift (1)rule 6: 1.3 heteroatom H shiftrule 5: 1.3 aromatic heteroatom H shiftrule 4: special iminerule 3: simple (aliphatic) iminerule 2: 1.5 (thio)keto/(thio)enolrule 1: 1.3 (thio)keto/(thio)enol
21 SMIRKS transform rules:
rule 21: phosphonic acidsrule 20: isocyanidesrule 19: formamidinesulfinic acidsrule 18: cyanic/iso-cyanic acidsrule 17: oxim/nitroso via phenolrule 16: oxim/nitrosorule 15: pentavalent nitro/aci-nitrorule 14: ionic nitro/aci-nitro
rule 13: keten/ynol exchange
Structure Normalization
Tautomerism
[O,S,Se,Te;X1:1]=[C;z{1-2}:2][CX4R{0-2}:3][#1:4]>>[#1:4][O,S,Se,Te;X2:1][#6;z{1-2}:2]=[C,cz{0-1}R{0-1}:3]
[N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2][N,n,S,O,Se,Te:3][#1:4]>>[#1:4][N,n,S,O,Se,Te:1][NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3]
32
O1
H 43
2O1H 4
N2
S1 N 3
H
H4
HN2
S1 N3
H
H4
H
1.3 keto/enol
1.3 heteroatom H shift
rule 1: 1.3 (thio)keto/(thio)enol
rule 6: 1.3 heteroatom H shift
Structure Normalization
Warfarin - Tautomers
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
prototropic tautomerism
Structure Normalization
Warfarin - Tautomers
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/representationhttp://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/representation
prototropic tautomerism
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
prototropic tautomerism
Structure Normalization
Warfarin – FICuS Identifier FICuS
http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/ficushttp://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/ficus
prototropic tautomerism
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS
09BB2FAADA1508A7-FICuS
09BB2FAADA1508A7-FICuS
2F505A3FCA434B3C-FICuS
ring-chaintautomerism prototropic tautomerism
Structure Normalization
Warfarin – FICuS Identifier FICuS
http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/ficushttp://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/ficusring-chain
tautomerism prototropic tautomerism
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
O
OH
OHO
O
O
OHO
O
O
O
HO
Structure Normalization
Warfarin –
QTXVAVXCBMYBJW-UHFFFAOYSA-N VWSXIGYSLWNCBN-VAWYXSNFSA-N GRAAPKVUSREWIL-UHFFFAOYSA-N
FQEPJUOLUDFINX-UHFFFAOYSA-N UCKRWKACBKRIKB-VAWYXSNFSA-N NNLYDNMZCAHUOV-UHFFFAOYSA-N
PJVWKTKQMONHTI-UHFFFAOYSA-N FVSFCRPKSVCTBA-VAWYXSNFSA-N BBOSKMPTDUUMKL-UHFFFAOYSA-N
LSCYDZJASSKSMJ-UHFFFAOYSA-N
XGIOTBZTMHLTRL-UHFFFAOYSA-N
QUJJIKXCACZKKD-UHFFFAOYSA-N
Standard InChIKey
ring-chaintautomerism
prototropic tautomerism
http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/stdinchikeyhttp://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/stdinchikey
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
O
OH
OHO
O
O
OHO
O
O
O
HO
Structure Normalization
Warfarin –
SAYISSDYYDIVTP-UHFFFAOYNA-N SAYISSDYYDIVTP-UHFFFAOYNA-N PMOPDASZKFXBOL-UHFFFAOYNA-N
SAYISSDYYDIVTP-UHFFFAOYNA-N SAYISSDYYDIVTP-UHFFFAOYNA-N PMOPDASZKFXBOL-UHFFFAOYNA-N
SAYISSDYYDIVTP-UHFFFAOYNA-N SAYISSDYYDIVTP-UHFFFAOYNA-N PMOPDASZKFXBOL-UHFFFAOYNA-N
LSCYDZJASSKSMJ-UHFFFAOYNA-N
FQOKLKCGRHFANU-UHFFFAOYNA-N
FQOKLKCGRHFANU-UHFFFAOYNA-N
InChIKey (W0 RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T) InChIKey (W0 RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T)ring-chain
tautomerism prototropic tautomerism
HO
O
O
HO
O
O
O
HO
O
O
O
O
O
O
OH
O
HO
O
O
O
HO
O
OH
O
HO
O
OH
O
HO
O
O
O
HO
O
O
HO
O
OH
OHO
O
O
OHO
O
O
O
HO
InChIKey
Structure Normalization
Warfarin
MIME type: text/plain
• “normalize” Standard InChIKey by NCI/CADD’s business rules:
http://cactus.nci.nih.gov/chemical/structure/normalize:QTXVAVXCBMYBJW-UHFFFAOYSA-N/stdinchikey
InChIKey=FQEPJUOLUDFINX-UHFFFAOYSA-N
O
O
O
HO
O
O
O
O
FQEPJUOLUDFINX-UHFFFAOYSA-N QTXVAVXCBMYBJW-UHFFFAOYSA-N
add_hyrogens, remove_hydrogens, normalize, ficts, ficus, uuuuu,scaffold_sequence, nostereo, stereoisomers, tautomers
• available operators:
http://cactus.nci.nih.gov/chemical/structure/scaffold_sequence:FQEPJUOLUDFINX-UHFFFAOYSA-N/stdinchikey
O
O
O
Structure Normalization
Chemical Operators
O
O
O O
O
O
XVYBSGQBRUYLNK-UHFFFAOYSA-N BQLSCAPEANVCOG-UHFFFAOYSA-N MERGMNQXULKBCH-UHFFFAOYSA-N
example:
Schuffenhauer et al., J. Chem. Inf. Model. 2007, 47, 47-58
Soon: Chemical File Resolver (CFR)
Chemical File Resolver (CFR)
CFRchemical
fileHTTP Post HTTP Getchemical
file
• allows conversion of many chemical file formats into another format or other representations
• will have a programmatic URL API & a HTML Web interface• url’izes all elements of the original file, i.e. provides access to each
specific record, field, and any metadata (size, record count, etc.) of the posted file by URLs
• release: Q2/2012 (hopefully)
HTTP Post
Chemical File Resolver (CFR)
curl -F upload=@/your/local/file.sdf http://cactus.nci.nih.gov/chemical/file>d85b396ed6ced6348a5b402eb8fcfe8b
• HTTP: post a file (e.g. with curl), CFR replies with a MD5 hash key:
• accepted formats:• chemical file formats: alc, cdxml, cerius, charmm, cif, cml, jme,
maestro, mol, mol2, mrv, pdb, sdf, sdf3000, sln, smiles, xyz, …• text files with a list of identifiers …
CFRchemical
fileHTTP Getchemical
file
HTTP PostCFR
chemical file
HTTP Getchemical file
Post a plain text file, e.g.:
curl -F upload=@/your/local/file.sdf http://cactus.nci.nih.gov/TEST/chemical/file>d85b396ed6ced6348a5b402eb8fcfe8b
• after posting a file, CFR replies with a MD5 hash sum:
• accepted formats:• chemical file formats: alc, cdxml, cerius, charmm, cif, cml, jme,
maestro, mol, mol2, mrv, pdb, sdf, sdf3000, sln, smiles, xyz, …• text files with a list of identifier:
ethanolaspirinInChI=1S/C4H10O/c1-3-5-4-2/h3-4H2,1-2H3CCOCCInChIKey=RCINICONZNJXQF-MZXODVADSA-NInChIKey=QTXVAVXCBMYBJW-UHFFFAOYSA-N 204255-11-8 tautomers:guanineChemSpider_ID=1234Pubchem_SID=456
Chemical File Resolver (CFR)
CFRchemical
fileHTTP Post HTTP Getchemical
file
• request new file format using the obtained MD5 hash key:
curl http://cactus.nci.nih.gov/TEST/chemical/file/{key}?format={sdf, smi, pdb, cml, …}
d85b396ed6ced6348a5b402eb8fcfe8b
Chemical File Resolver (CFR)
CFRchemical
fileHTTP Post HTTP Getchemical
file
• request record 2 and 5 as SMILES string:
curl http://cactus.nci.nih.gov/TEST/chemical/file/{key}?records=2,5&format=smiles
d85b396ed6ced6348a5b402eb8fcfe8b
Chemical File Resolver (CFR)
CFRchemical
fileHTTP Post HTTP Getchemical
file
• get field names:
curl http://cactus.nci.nih.gov/TEST/chemical/file/{key}/fields
• get a specific field value from record n:
curl http://cactus.nci.nih.gov/TEST/chemical/file/{key}/n/{field_name}
Chemical Structure Web API
ChemicalFile
Resolver
NCI/CADDweb service
NCI/CADD Chemical StructureDatabase (CSDB)
CACTVS
externalweb services
http
ChemicalIdentifierResolver
othersoftwarepackages
Chemical Structure Web API
OPSIN
IUPAC InChI/InChIKey Resolver
• (hopefully) there will be many resolvers from differentproviders with different background:• publishers
• commercial databases
• free sources and databases: ChemSpider, PubChem, ChEBI, …
• InChI/InChIKey is the perfect tool to interlink the resolvers
• ChemSpider, PubChem and NCI/CADD are working on a test protocol for a federated InChI/InChIKey resolver
IUPAC InChI/InChIKey Resolver
IUPAC Root Resolver
Resolver 1
Resolver 2
Resolver 3
Resolver 3.1
Resolver 3.2
Resolver 3.3
ClientsCIR
Resolver 3
http://cactus.nci.nih.gov
http://cactus.nci.nih.gov/blog
Acknowledgments
The InChI Team
Xemistry GmbH, GermanyWolf-Dietrich Ihlenfeldt
All Database providers ChemNavigatorScott HuttonTad Hurst
University of Cambridge, UKDaniel Lowe
NCI/CADD TeamIgor FilippovMarc Nicklaus
University College Cork, IrelandNoel O’ Boyle
Acknowledgments - Software
CACTVS
Python Web FrameworkChemWriter
Python SQL Library
Javascript library
Peter Ertl (Novartis)
Fulltext Search Engine