1 chemical structure representation and search systems lecture 2. oct 30, 2003 john barnard barnard...
TRANSCRIPT
1Chemical Structure Representation
and Search Systems
Lecture 2. Oct 30, 2003
John Barnard
Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services
Sheffield, UK
2 Lecture 2: Topics to be Covered
Problems for chemical structure representation• aromaticity• tautomerism• multi-centre bonds• stereochemistry• organometallics and inorganics• macromolecules and polymers• incompletely-defined substances
o Markush Structures
4
Structure diagrams and topological graphs
useful analogy, but not a perfect one• identical graphs identical molecules• different graphs different molecules
realities of chemical structures cause problems
//
5 Aromaticity
electronic property of certain ring systems, giving enhanced chemical stability
bonds in aromatic rings have properties that are distinct from single and double bonds
generally accepted definition is Hückel rule• 4n+2 pi-electrons (n is a small integer)
there are borderline cases aromaticity causes problems for computer
representation• different systems deal with it in different ways
6 Aromaticity problems
using single and double bonds can give different topological graphs for the same compound
one solution is to usean aromatic bond type
Br Br
BrBr
Br
Br
7 Alternating bonds and aromaticity
Chemical Abstracts Registry System uses a “normalised” bond type for all rings with alternating single and double bonds
• this includes some systems that are not aromatic(8 ≠ 4n+2)
• and omits some that are S
8 Representing aromaticity
some systems represent aromaticity as an atom property
• SMILES allows use of lower-case atomic symbols for aromatic atoms (adjacent aromatic atoms are assumed to be joined by aromatic bonds)
problem is that aromaticity is really a ring property
S
s1cccc1S1C=CC=C1
Brc1c(Br)cccc1BrC1=C(Br)C=CC=C1
Br
Br
9
Aromaticity: problem areas
Aromaticity is sometimes a matter of degree or opinion
Aromatic envelope rings Outer ring has 10 = 4n+2 pi electrons fusion bond is not aromatic
Exocyclic bonds: right ring has 6 pi electrons
2 from usp, 2 from bond in ring, 2 from bonds in left ring and 0 from exocyclic bond to O)
O
O
..
..
10 Tautomerism
dynamic equilibrium between positional isomers (labile H)
are they different compounds?• answer depends on what you want to do with them
can use normalised bondsto represent them by a single graph• gets mixed up with ring
alternating bonds• some tautomers may be
aromatic, when others are not
NH
O
N
OH
N
O H
11 Tautomerism
tautomerism is a matter of degree tautomers can be defined in different ways
HQ–X=R Q=X–RHonly certain elements can be Q, X or R
o keto-enol tautmersare not recognisedby Chemical Abstracts
o mono-unsaturatedcarbon chains arenot distinguishedby Daylight
OH O
OH
O
OH
O
12 Structure conventions
sometimes called “business rules”• some chemical groups can be shown in different but
equally valid ways
• conventions are needed to determine which is preferred• software may be needed to convert to preferred form
NOO
N+
OO
13 Structure conventions
Getting the structure representation “right” can be very important• automatic property prediction
o wrong tautomeric form can give poor prediction of solubility, acid dissociation constant etc.
• receptor site dockingo molecular modelling programs “dock” small molecules into
protein receptor sites, and calculate score based on hydrogen-bond interactions, charges etc.
o wrong ionisation state / tautomer can give misleading results
14 Multi-centre bonds
sometimes bonds involve more than 2 atoms• graph edges always involve exactly 2
e.g. ferrocene
most systems fudge this sort of structure• bond to arbitrary carbon• bonds to all 5 carbons• bond to dummy atom placed in ring
o which itself has dummy bonds to ring atoms
Fe
15 Stereochemistry
different compounds with identical connectivity same topology, different topography
S-tyrosine R-tyrosine
16 Stereochemistry
configuration is often unknown • or partially known (relative stereochemistry)• or you may have a mixture of stereoisomers
o in which one isomer may occur in enantiomeric excess
many different descriptors used by chemists• wedge (up) and hatched (down) bonds in structure
diagrams• Cahn, Ingold, Prelog (CIP) designators (R, S, E, Z)• text-based descriptors (stereoparent, or optical rotation)
17 Stereochemistry: up/down bonds
can be used as additional “colours” for graph edges• many connection table
formats have special codes for up and down bonds
• need to know which end of bond is which
useful for re-generating diagrams for display can be used to calculate other stereo descriptors
OH
CH2NH2
O OH
OH
CH2NH2
O OH
18 Up/down bond problems
different patterns of up/down bonds can show the same stereo- isomer
• different graphs, same molecule
some patterns of up and down bonds actually convey no useful information about configuration
OH
CH2NH2
O OH
OH
CH2 NH2
OOH
ClF
CH3
CH2
CH3
19 Stereochemistry: CIP designators
R.S. Cahn, C. Ingold, and V. Prelog, Angewandte Chemie Intl. Ed. in English 1966, 5, 385-551
one-letter designator for stereocentres• based on rules assigning priorities to groups around it• tetrahedral carbons (R, S)• double bonds (E, Z)
additional colours for graph nodes or edges• useful for distinguishing stereoisomers when absolute
configuration is known• less useful for matching parts of structures (substructure
search) as priority rules can cause designator to change when remote part of structure is changed
20
Stereochemistry: ordered “stereovertex” lists define order of neighbours around stereocentre
• there are two sets of equivalent orders, corresponding to the two configurations of a tetrahedral carbon atom
A
B
CD
A B C DA D B CA C D BB C A DB D C AB A D CC A B DC D A BC B D AD A C BD B A CD C B A
A
B
C D
A D C BA C B DA B D CB A C DB D A CB C D AC B A DC D B AC A D BD A B CD C A BD B C A
neighbours are listed arounda right-handed spiral
21 Stereochemistry: stereovertex lists
Two alternative approaches:1. Geometric ordering
List neighbours of stereo centre in a predefined order for the geometry
(e.g. right-handed spiral)
Advantages:• ordering is locally-defined (rest of molecule is
irrelevant)• stereocentre need not be a single atom
Disadvantage: • equivalent orderings need to be defined
22 Stereochemistry: stereovertex lists
2. Parity value• most common used approach in practice• list neighbours according to an ordering rule
• atom numbers in connection table• CIP priority rules
• decide which geometry they conform to • right-handed (clockwise) or left-handed (anti-
clockwise) spiral• record this as parity value on stereocentre
• CIP R and S designators are an example of this• potential disadvantage:
• ordering rule may be globally defined (rest of molecule is relevant)
23 Stereochemistry: parity values
MDL formats:• number atoms around stereo centre with 1, 2, 3, and 4 in
order of increasing connection table atom numbero “implicit” hydrogen atom is considered to be atom 4
• view stereo centre so that the bond to atom 4 projects behind the plane formed by atoms 1, 2, and 3
• if numbers increase:o clockwise: parity value is 1o anti-clockwise, parity is 2
• parity value stored at nodefor stereo centre atom
o parity 0 = not stereoo parity 3 = unknown stereo
1
32
4
P ari ty 2
1
23
4
P ari ty 1
24 Stereochemistry: parity value
Stereochemistry in SMILES clockwise/anticlockwise approach, like MDL atoms are numbered according to sequence of atoms
in SMILES view from first atom (instead of toward last atom as
in MDL)• if other three atoms are anticlockwise – use @• if other three atoms are clockwise – use @@
OC(=O)[C@H](N)CC1=CC=C(O)C=C1OC(=O)[C@@H]
(N)CC1=CC=C(O)C=C1
25 Double bond stereochemistry
depiction of double bonds in a structure diagram usually implies either cis or trans configuration
MDL files use bond type code to indicate• 0: use 2D atom co-ordinates to determine cis/trans• 3: double bond stereochemistry not specified(other code values are used for up/down/either single
bonds)
ClI
Br
F
FI
Br
Cl
26 Double bond stereo in SMILES
/ and \ used as “directional” single bonds• only meaningful when used on both atoms of a
double bond• several ways of showing same configuration
ClI
Br
F
FI
Br
Cl
Cl/ C(F)=C(Br)/ I Cl\ C(F)=C(Br)/ I
Cl\ C(F)=C(Br)\ I Cl/ C(F)=C(Br)\ I
27 Stereovertex lists for double bonds neighbours of stereocentre have rectangular
geometry
A B
CD
A B C DB C D AC D A BD A B C
A C B DB D A CC B D AD A C B
neighbours are listed arounda right-handed spiral (clockwise)
A C
BD
28 Other stereochemistry geometries
Many coordination complexes have other stereochemical geometriese.g.
there are special SMILES rules for these specification of equivalent geometric orderings
defines symmetry properties of each geometry
1
2 3
4
5
SquareP yram id
1
2 3
45
O c tahe dro n6
1
2
34
Trigo nalB ipyram id
5
29 Stereochemistry of biphenyls
some stereoisomers occur because of sterically-hindered rotation of a single bond
o stereocentre is C–C bond here geometric ordering of
neighbours of stereocentrecan specify configuration
3 1 4 2
Cl
Br OH
CH3
1 2
3 4
1
2
3
4
A nti-re c ta ngle
30 Allene stereochemistry
anti-rectangle geometry alsoapplies to allene configuration• stereocentre is C=C=C group
CBr
I
F
Cl 1
2
3
4
A nti-re c ta ngle
31 Stereochemistry: conclusions
Many different systems in use Interconversions between different representations not
always easy• e.g. wedge bonds → CIP descriptors
Several problems remain• incomplete/partially-defined stereochemistry• “knotted” structures, helices etc.
B. Rohde, “Representation and manipulation of stereochemistry”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 206-230. Wiley, 2003
32 Other representation complications
Organometallic and co-ordination compounds• complex stereochemistry• special bond types may be needed (dative bonds etc.)• ambiguity over covalent/ionic character of bonds
o “business rules” rules usually needed
Inorganic compounds• topological representation often not possible• composition may not involve integral ratios between
elements
33 Macromolecules
in principle can represent all atoms, as for small molecules
some systems use “shortcuts” or “superatoms” for subunits (e.g. amino acids)
AspHis
ValCys
Gly AlaHis
ValOH
CysArg
Trp
Tyr
ValTyr
AlaArg
ProAla
AspTyr
GlyGly
Ala OH
34 Macromolecules
Each shortcut is defined with appropriate attachment points
ordinary atoms can bemixed with shortcuts
system can expandshortcuts when needed
Tyr
NH*
O
O
*"
OH
35 Polymers
special problems are presented because properties of polymer can be affected by polymerisation conditions• average number of subunits• extent of cross-linking• ratio between different subunits• random / block sequences of subunits• etc.
Two main approaches• monomer representation• structural repeating unit (SRU) representation
36 Polymers
Monomer-based representation• show original monomer(s) and describe
polymerisation conditions in text notes SRU-based representation
• show repeating units (as shortcuts), with details of length etc.
• generally more satisfactory for structure search• complications when composition is
incompletely defined
37 Incompletely-defined substances
unknown stereochemistry unknown attachment position unknown repetition
OH
n
NH2
Cl
38 Markush (“Generic”) structures
• structures with R-groups• shorthand for describing sets of structures with
common featuresOH
R1R2
Br
*
I*
Cl
*R1=
CH2
*
CH3CH2
* CH2CH3 CH2
* CH2CH2
CH3R2=
39 Markush structures
• also called “generic” structures• very important in chemical patents
o inventor claims whole class of related compounds
• can be used to describe combinatorial libraries• can be used as queries in database searches• will be discussed in more detail in lecture 5
(Nov 13)
40 Conclusions from Lecture 2
analogy between chemical structures and topological graphs is not perfect and many problems arise in situations where it breaks down• aromaticity and tautomerism• stereochemistry
additional complications arise in representing some classes of molecule• inorganic and coordination compounds• macromolecules and polymers• incompletely-defined substances