inchi keys as standard global identifiers in chemistry web...

26
InChI keys as standard global identifiers in chemistry web services Russ Hillard ACS, Salt Lake City March 2009

Upload: lekiet

Post on 27-Feb-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

InChI keys as standard global identifiers

in chemistry web services

Russ Hillard ACS, Salt Lake City

March 2009

Context of this talk

•  We have created a web service

•  That aggregates sources built independently - Dozens individual databases - Containing Molecules and reactions - Created using non-standardized business rules (wrt chemical representation)

•  Covers large record sets - 30+ million unique molecules from combined sources - 5+ million unique reactions from combined sources

•  Requires integration across all sources -  Based on shared chemical entities -  Where “entity” means chemical compound(s) -  And “chemical compound” has a unique identifiers

-  Chemical structure elucidated by scientists -  Systematic chemical name derived from structure -  Graphic representation of structure assigned at registration -  Trivial chemical name assigned to structure -  Registry number assigned to structure -  Key or string computed from structure

The basic problem . . .

ChemInform (FIZ Chemie)

Beilstein (Elsevier) BRN3936786

Curr. Chem Reactions (Thomson)

BRN3936786

5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R)

trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd

(2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom)

•  Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . . •  Relationship of Structure:RegNumbers if often 1:many

One solution

•  Define our own set of registration rules

•  Register all structures to one big database - Normalize structures according to our rules

•  Assign a unique record identifier (URI) to the normalized structures

•  Correlate our URIs to the native sources

•  Use our URIs to correlate records across different databases

•  We have done this but have not exposed the URIs - Even with modern computers this is resource intensive - Problem is compounded when data is from different providers - Does the world really need another “Global Registry Number”?

As currently implemented this gives:

ChemInform (FIZ Chemie)

Great for internal correlations: Reactions Commercial Availability

Toxicity Bioactivity . . . etc

Molecules Synthetic preparations of Organic reactions of Toxicity . . . Etc

But what about external correlations? Anything we don’t/can’t index Commercial data Proprietary data

Will focus on these two options

•  Assume structures as registered are correct - Accept that we cannot always normalize according to our rules

•  Use a derived (calculated) compound identifier

•  Is this possible? - IUPAC Name - Wiswesser Line Notation (WLN) - Molfile and its derivatives - SEMA Key - MDL Line Notation - SMILES - Chemical Markup Language (CML) - InChI Name - InChI Key -  NEMA key

Alternative solution

IUPAC - International Chemical Identifier

The objective of the IUPAC Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.

The initial work focused on the development of algorithms for converting an input organic chemical structure to a unique (canonical) form. This, in effect, involves the unique numbering of each atom, with equivalent atoms being assigned identical numbers. "Serializing" the result to create a string is the final, straightforward, step in creating an identifier. From: http://www.iupac.org/web/ins/2000-025-1-800

For this presentation all InchI Keys are generated using:

 final standard InChI/InChIKey v. 1.02 so7ware

The Morgan Algorithm

Invented by H. L. Morgan, J. Chem. Doc., 5, 107 (1965) - Underpins many of the systems in use today - The basis of CAS Online

Identifies atoms based on an extended connectivity value and the atom with the highest value becomes the first atom in the name, and its neighbors are then listed in descending order – ties are resolved based on additional parameters, for example bond order, and atomic number

Does not handle stereochemistry

SEMA developed to handle stereoisomers - W. T. Wipke and T. M. Dyott, J. Amer. Chem. Soc., 96,

4825, (1974).

NEMA

NEMA produces a unique name and key for a wider range of structures than SEMA. It extends perception to non-tetrahedral stereogenic centers, it supports both 2D and 3D stereochemistry perception, and it does not have an atom limit. It is a proprietary to Symyx, but it is exposed in our products, for example Symyx Draw and Symyx Direct generate NEMA keys.

The work of Wipke et al identified the value of a constitutional key and a stereo key. This approach has been incorporated into NEMA.

W. T. Wipke, S. Krishnan, and G. I. Ouchi, J. Chem. Inf Comput. Sci., 18, 32, 1978

Tautomers (mobile H atoms)

Different structures

Different systematic names

Presumably exist in equilibrium

InchI Keys are identical

NEMA Keys are different

Both structures are registered to our collection

57531-38-1 assigned to both structures

4(5)-chloro-5(4)-nitroimidazole 5(4)-chloro-4(5)-nitroimidazole 4-chloro-5-nitroimidazole 5-chloro-4-nitroimidazole 4-chloro-5-nitro-1(3)H-imidazole

Tautomers (“mobile hydrogen atoms”)

Same InchI Key

Different NEMA Keys

Mesomers

Same InchI Key

Mesomers ideally would have the same identifier

Different NEMA Keys

Both structures are registered to our collection

Methylene blue 61-73-4

Mesomers?

Same InChi Key Different NEMA Keys

Same InchI Key Same NEMA Keys

Stereoisomers

Pure enantiomer

Enantiomeric pair

No stereo

InchI does not distinguish pure enantiomer from raceme

Relative versus absolute stereochemistry

Indistinguishable based on InchI Key

Absolute Stereochemistry InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N

3 unique NEMA Keys

1s

1S

2R

InchI Key = XARGIVYWQPXRTC-DTWKUNHWSA-N

Concern with stereochem goes back to…..

ChemInform (FIZ Chemie)

Beilstein (Elsevier) BRN3936786

Curr. Chem Reactions (Thomson)

BRN3936786

5693-99-2 stereochem unspecified 71403-94-6 relative stereochem 121651-02-3 absolute stereochem (2R,3S) 126720-47-6 absolute stereochem (2S,3R)

trans-3-phenyloxirane-carboxaldehyde (2R*,3R*)-2,3-epoxycinnamaldehyde trans-cinnamaldehyde epoxide Epoxyzimtaldehyd

(2S,3R)-3-phenyl-oxirane-2-carbaldehyde (Autonom)

•  Don’t always have or know the BRN, CASRN, ChemSpiderID, MFCD# . . . •  Relationship of Structure:RegNumbers if often 1:many

Typically problematic structures

Definitely the same compound

Same InchI Key

Different NEMA Keys

Typically problematic compounds

Just the tip of the iceberg

Organometallics

Inorganics

Layered structure of InchI Keys

AAAAAAAAAAAAAA-BBBBBBBBCD

AAAAAAAAAAAAAA = skeleton

BBBBBBBB = structural features mobile hydrogens, isotopes, metal bonds ...

C = flag, InchI version . . .

D = check character

Ability to reconstruct InChi Keys into classes of related structures sets them apart

InChI key resolution using ChemSpider

Full InChI key search

Partial InChI key search

There is still plenty to do……

Biologics Average pipeline contains 22% biologics Some companies are near 50% Peptides & modified peptides Nucleic acid sequences

Generics Markush structures

Polymers Repeating monomers Block copolymers Cross-linked polymers

So what should go into our web service?

•  Unique chemical structures registered to Compound Index

•  Unique reaction structures registered to Reaction Index

•  Assigned global identifiers as available - Registry numbers (BRN, CASRN, MFCD#s, PubChemIDs. . .)

•  Computed global identifiers for all compounds -  InChI strings -  InChI Keys -  NEMA Keys

•  Register InChi Keys to ACD and other Symyx databases

•  Let the consumer decide which to use