in-house chemical databases at imperial chemical industries

5
In-house chemical databases at Imperial Chemical Industries G W Adamson, J M Bird, G Palmer and W A Warr* ICI Pharmaceuticals Division, Alderley Park, Macclesfield, Cheshire SK10 4TG, UK ICI has mounted several online databases for end-user access under its new graphics system, Sapphire. Sapphire uses the Maces@ software for chemical structure regis- tration and searching. This paper concentrates on those databases which use Maces software for both structure and data handling, rather than on the larger Company Compound Centre database which will use Maces in conjunction with a relational database management system. Keywords: chemical database, relational database management system, chemical structure registration received 3 February 1986, accepted 10 February 1986 ICI is presently developing a system called Sapphire (Structures and Properties Produced by Helpful Inter- active Rapid Enquiry), a user-friendly, interactive system for the storage and rapid retrieval of chemical structures and related property datale2. The system will have inter- faces to other ICI systems such as molecular modelling, reaction design and biological data handling and even, it is hoped eventually, to systems external to ICI for handling information from the scientific literature. Within Sapphire, chemical structure registration and searching is handled by Maces software licensed from Molecular Design Limited (MDL)3. The authors have also licensed the MDL product Maccslib. This is a library of routines that allows a user to write FORTRAN programs which can access a Maces database. ICI has also licensed the MDL programs Layout and Dataccs-I. Layout converts Maces connection tables to structure coordinates. Dataccs incorporates, amongst other things, the function previously provided in a separate program Molrst (‘Molrestore’) to update a Maces database, noninteractively. Before the advent of Sapphire, chemical information at ICI was handled by the Crossbow system”” with Wiswesser Line Notation (WLN) as the tool for structure representation. In Sapphire, structures are stored as Maces connection tables and Sema names (Stereo- chemically Extended Morgan Algorithm). Existing WLN records had to be converted to connection tables and this was carried out using the Daring routines from Fraser-Williams Scientific Systems. *This paper was presented by Dr W A Warr before the Division of Chemical Information, 190th National Meeting of the American Chemical Society, Chicago, September 8-13, 1985 HARDWARE Sapphire is run on a Vax 11/785. The commonest user terminal is the DEC VTlOO-plus-Retrographics known as a DQ650M or VT640 (or an IBM-PC emulating this terminal), but for registration and conversion procedures ICI Pharmaceuticals Division uses two IMLAC termi- nals. In order to input large numbers of structures rapidly a high resolution terminal with large screen and almost instantaneous responses to bond deletions, re- orientations and so on is advantageous. A good ‘drawer’ can input in excess of 30 average structures per hour. USAGE Over 40 terminals have been installed specifically for Sapphire in the Pharmaceuticals Division, and in June 1985 75 different users actually accessed the system. Over the company as a whole, several hundred users have been trained and the number of sessions per month is numbered in thousands. The Sapphire databases at Alderley Park can be accessed from at least seven locations in the UK as well as from ICI Pharma in France. ICI Americas have established a link to the UK database, but also have their own copy of the Maces software. The main users of Sapphire are Pharma- ceuticals, Plant Protection and Organics Divisions. DATABASES AVAILABLE UNDER SAPPHIRE The following databases have been converted to Maces: a) 340 000 compounds from the Company Compound Centre (CCC) database with no related data at present other than a numerical key. b) The Cambridge Crystallographic Database and the related data. c) The Hansch (Pomona College) physical chemistry database of structures and related log P data. d) An internal ICI Hansch-type database of compounds with measured log P values. e) A database of compounds available commercially with supplier data and chemical names (CAOCI, the Commercially Available Organic Compounds Index, a misnomer since it also contains inorganics). f) The structures and locations of chemicals on shelves in laboratories and stores (‘Labstock’). g) A small number of more specialized databases. The CCC database has always been the main object of the Sapphire project. However, in this paper it is intended to discuss some of the ‘lesser’ databases. These Volume 4 Number 3 September 1986 0263-7855/86/030165-05 SO3.00 @ 1986 Butterworth & Co (Publishers) Ltd 165

Upload: g-w-adamson

Post on 21-Jun-2016

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: In-house chemical databases at Imperial Chemical Industries

In-house chemical databases at Imperial Chemical Industries

G W Adamson, J M Bird, G Palmer and W A Warr*

ICI Pharmaceuticals Division, Alderley Park, Macclesfield, Cheshire SK10 4TG, UK

ICI has mounted several online databases for end-user access under its new graphics system, Sapphire. Sapphire uses the Maces@ software for chemical structure regis- tration and searching. This paper concentrates on those databases which use Maces software for both structure and data handling, rather than on the larger Company Compound Centre database which will use Maces in conjunction with a relational database management system.

Keywords: chemical database, relational database management system, chemical structure registration

received 3 February 1986, accepted 10 February 1986

ICI is presently developing a system called Sapphire (Structures and Properties Produced by Helpful Inter- active Rapid Enquiry), a user-friendly, interactive system for the storage and rapid retrieval of chemical structures and related property datale2. The system will have inter- faces to other ICI systems such as molecular modelling, reaction design and biological data handling and even, it is hoped eventually, to systems external to ICI for handling information from the scientific literature.

Within Sapphire, chemical structure registration and searching is handled by Maces software licensed from Molecular Design Limited (MDL)3. The authors have also licensed the MDL product Maccslib. This is a library of routines that allows a user to write FORTRAN

programs which can access a Maces database. ICI has also licensed the MDL programs Layout and

Dataccs-I. Layout converts Maces connection tables to structure coordinates. Dataccs incorporates, amongst other things, the function previously provided in a separate program Molrst (‘Molrestore’) to update a Maces database, noninteractively.

Before the advent of Sapphire, chemical information at ICI was handled by the Crossbow system”” with Wiswesser Line Notation (WLN) as the tool for structure representation. In Sapphire, structures are stored as Maces connection tables and Sema names (Stereo- chemically Extended Morgan Algorithm). Existing WLN records had to be converted to connection tables and this was carried out using the Daring routines from Fraser-Williams Scientific Systems.

*This paper was presented by Dr W A Warr before the Division of Chemical Information, 190th National Meeting of the American Chemical Society, Chicago, September 8-13, 1985

HARDWARE

Sapphire is run on a Vax 11/785. The commonest user terminal is the DEC VTlOO-plus-Retrographics known as a DQ650M or VT640 (or an IBM-PC emulating this terminal), but for registration and conversion procedures ICI Pharmaceuticals Division uses two IMLAC termi- nals. In order to input large numbers of structures rapidly a high resolution terminal with large screen and almost instantaneous responses to bond deletions, re- orientations and so on is advantageous. A good ‘drawer’ can input in excess of 30 average structures per hour.

USAGE

Over 40 terminals have been installed specifically for Sapphire in the Pharmaceuticals Division, and in June 1985 75 different users actually accessed the system. Over the company as a whole, several hundred users have been trained and the number of sessions per month is numbered in thousands. The Sapphire databases at Alderley Park can be accessed from at least seven locations in the UK as well as from ICI Pharma in France. ICI Americas have established a link to the UK database, but also have their own copy of the Maces software. The main users of Sapphire are Pharma- ceuticals, Plant Protection and Organics Divisions.

DATABASES AVAILABLE UNDER SAPPHIRE

The following databases have been converted to Maces:

a) 340 000 compounds from the Company Compound Centre (CCC) database with no related data at present other than a numerical key. b) The Cambridge Crystallographic Database and the related data. c) The Hansch (Pomona College) physical chemistry database of structures and related log P data. d) An internal ICI Hansch-type database of compounds with measured log P values. e) A database of compounds available commercially with supplier data and chemical names (CAOCI, the Commercially Available Organic Compounds Index, a misnomer since it also contains inorganics). f) The structures and locations of chemicals on shelves in laboratories and stores (‘Labstock’). g) A small number of more specialized databases.

The CCC database has always been the main object of the Sapphire project. However, in this paper it is intended to discuss some of the ‘lesser’ databases. These

Volume 4 Number 3 September 1986 0263-7855/86/030165-05 SO3.00 @ 1986 Butterworth & Co (Publishers) Ltd 165

Page 2: In-house chemical databases at Imperial Chemical Industries

have one thing in common which di~er~ntiates them from the CCC database. This is the fact that both chemi- cal structures and related data are handled by Maces for databases b) to g), whereas for the CCC database a relational database management system will be used for sample and biological data.

CONVERSION FROM WLN

Apart from the Cambridge Crystallographic Database, and one or two minor specialized databases, all the chemical structural data was already available as WLNs at the start of the Sapphire project.

The main programs used in the conversion exercise were Ctgen (written by ICI) and Layout and Dataccs (supplied by MDL). Ctgen calls on the Daring routines to convert a WLN to a connection table. Ctgen also makes the connection table suitable for input to Layout, which generates structure coordinates. Dataccs is used to update a Maces database. The functions of Ctgen are to:

a) Read WLNs, suffices and reference numbers from the database to be converted. b) Use the Daring subroutines to convert the WLNs to connection tables (and check molecular formulae). c) Adjust the Daring connection table bonds for com- pounds with 5-bonded nitrogen (e.g. nitro groups). d) Remove dative (-+) bonds from chelates. e) Remove the Daring special ferrocene bond. f) Flag as errors compounds with more than 256 non- hydrogen atoms or bonds. g) Convert the Daring connation table format to Layout connection table format. h) Produce an SD file suitable for input to Layout.

After conversion was complete, all Maces structures were scanned by information scientists. Very approxima- tely, 5% of structures needed parts flipped or rotated. A very much smaller percentage had to be either largely or completely redrawn. Structures in the latter category were checked by a second information scientist. On some databases, compounds which have been drawn manually have a flag set in a special datatype, so that they need not be drawn again when the database is recreated/ updated.

Because all structures are scanned and careful checks carried out, and also because the initial WLN databases were highly accurate, it is possible to be confident about the quality of the Sapphire databases.

CAMBRIDGE CRYSTALLOGRAPHIC DATABASE

This database was established in 1965 and is a collection of results of X-ray and neutron diffraction studies of compounds published since 1935”. The producer of the database is the Crystallographic Data Centre of Cambridge University in the UK. The machine-readable version consists of three files containing structures, bibliographies and crystal data. The three files were con- verted into a single Maces-type database. The structural data was in the form of connection tables, not WLNs, and a modified form of Ctgen had to be written to convert the Cambridge connection table to Maces form. Programs were also written to format the data into the required Maces SDfiles/datatypes.

166

All the information relating to a single X-ray deter- mination was stored together under the one structure. Some compounds have been reported in the literature more than once and their structures are duplicated in the database in order to keep the data separate.

The Cambridge database contains approximately 32000 compounds. The Maces version used by the authors contains approximately 21000. At the time of conversion Maces could not handle atoms with more than 5 connections or compounds with metal x-bonds, or various inorganics and metal complexes. Some of these restrictions have now been removed.

A Maces display of a typical structure is shown in Figure 1 and the related data in Figures 2-6. Note how

Figure f . Str#ct~re~rom the eambr~dge ~rystall~graphi~ ~ata~a~e in the Maces Exec mode

Figure 2. Maces data display from the Cambridge Crystallographic Database, first screen

the full name of the compound is given in datatype 4.

Chemists prefer to see structures and data together on the same screen, so some software has been written within ICI to make this possible. Figure 7 shows such a display. The data on the left of the screen can be scrolled alongside the structure to which it relates.

LOG P DATABASE

The experimental log P and p& values of a large number of chemical compounds have been collected together by

Journal of Molecular Graphics

Page 3: In-house chemical databases at Imperial Chemical Industries

NAME

FIND RON0

CANCEL CURR

REGISTER DATA

FILE

keyboard input

Figure 3. Maces data display from the Cambridge Crystallographic Database, second screen

Figure 4. Maces data display from the Cambridge Crystallographic Database, third screen

H12, 3580 8370-2700 “122 3230 7000-3640 HI*3 4040 8510-3310 “13 -664 1380-3530 “14, -m-1760-3810

;;I: 1110-1420-3130 -590- 4490 530

SEARCH EXlT

ATTKH HELP

,100 < CRYsrAL.CONNECTI”,TY > DTCm2 BLANI( PLOT

FCIRMAT I - 43 ATOMS 222 1 3 3 5 5 7 7 1 Q,,,213,3,7 92017 620 422 ,

DRAW DATA

bllt’dl 3 5 7 Q1,1,,5,5,5,Q,Q,Q2,2,2,22 232323 910

>I00 < “NIT.CE!_L > DTWI3 I,.6069 7.8888 8.2023 QO 91.99 90 kcm’dl SPACE GRCWP 4 PZ, NAME

FIND RON0

CANCEL C”RR

REGlSTER DATA

FlLE

keyboard h&l”,

Figure 5. Maces data display from the Cambridge Crystallographic Database, fourth screen

DATA

Hansch and Leo” (at Pomona College, USA). It is supplied as a set of printable tiles on magnetic tape. It contains details of about 25 000 experimental values for about 11000 compounds. The volume of data is currently growing at the rate of about 900 determina- tions every 6 months.

The structures are supplied as WLNs, which ICI con-

SEARCH EXIT

ATTACH HELP

BLANK PLOT

DRAW DATA

NAME

FIND RGNO

CANCEL C”RR

REGlSTER DATA

FILE

keyboard Input

Figure 6. Maces data display from the Cambridge Crystallographic Database, fifth screen

MOL.WE,GHT

357.285

REF.NO

MPlClN

JOURNAL

J.CHEM.90C.C ,970.- ,958

JOURNAL.CODE.N”MBER

087

REUA9,LITY.OF.DATA

R-FACTOR R-0.0530

COORD9.CRYSTALAXES

FRACTIONAL FORMAT2

Cl0 58179 -38507 29440

C12 34007 Cl3 36875

-14115 4.44050 I-METHYL-2-PICRYLIMINO- lNDOLlNE - 10370 62‘65

CM 3,625 -1457 6,402

C,5 22145 3950 53834

Figure 7. Sapphire display of structures and data from the Cambridge Crystallographic Database

verts to Maces connection tables and structure co- ordinates. Stereochemistry is added manually to the Maces database where Pomona College has supplied absolute stereochemistry which can be handled by the Maces software.

Again, ICI has written software to assign the data to Maces datatypes. The log P and pK, values are held exactly as the ‘print’ record on the original tape, together with the solvent used in the determination. Reference and footnote keys for each entry are held alongside, preceded by a letter R or F, respectively. Entries for one compound are sorted into solvent order. Literature references and footnotes are stored in separate data- types, together with the appropriate R or F number which relates them to the log P and pK, values. A typical structure is shown in Figure 8 and the Maces data layout for the same compound in Figure 9.

If the user calls up not Maces itself, but ICI’s ‘Sapphire Phase II’ module (which accesses Maces databases) he may obtain a display such as that shown in Figure 10, where data can be scrolled alongside a structure.

COMMERCIALLY AVAILABLE ORGANIC CHEMICALS INDEX (CAOCI)

Before a chemist starts looking in catalogues for the intermediate he requires, he will check whether a sample

Volume 4 Number 3 September 1986 167

Page 4: In-house chemical databases at Imperial Chemical Industries

0

0

xl+ “\ Figure 8. Structure from the Hansch log P Database in Maces Exec mode

SELECT OPTION

Figure 9. Maces data display .from the Hansch log P Database

Figure 10. Sapphire display of structures and data from the Hansch log P Database

is already on site. This he can do using the Labstock database under Maces. If he must buy in the compound, he need not look through multiple catalogues manually, exercising his chemical nomenclature skills. He can access CAOCI under Maces. The CAOCI and Labstock databases are similar in structure and usage (although the systems differ considerably) so we will consider only CAOCI here.

/3x

NAME

GiJ RGNO

Figure I I. Maces data display from the CAOCI Database

ICI still uses the term CAOCI instead of Fine Chemicals Directory, FCD, under which name Fraser- Williams and MDL market similar products. This is because there are significant differences between the pro- ducts, and also CAOCI is geared to ICI’s needs. The development of CAOCI has been described by Walker13. Converting the WLN-based CAOCI database to one of a Maces type and formatting the catalogue entries into a display pleasing to the end-user involved con- siderable planning and programming effort. A typical catalogue-entry datatype display in Maces is shown in Figure 11. For compounds which have more than one supplier, the catalogue records are given in alphabet- ical order of supplier name and reference number. There is a spacer record, consisting of a single full stop. This indicates the end of an entry, which may be split across two display screens.

CONCLUSIONS

This paper aims to show the variety of data which end- user chemists can access under Sapphire at ICI. Details of the system design and programming that were needed to produce all these facilities for users may be the subject of another publication.

REFERENCES

Warr, W A ‘Maces - an ICI view’ Proc. 7th Int. OnZine Znf Meeting London, UK (1983)

Adamson, G W et al. ‘Use of Maces within ICI’ J. Chem. Inf. Comput. Sci. V0125 (1985) pp 9&92

Anderson, S ‘Graphical representation of molecules and substructure queries in Maces’ J. Mol. Graph. Vo12 (1984) pp 83-90

Hyde, E et al. ‘Conversion of Wiswesser notation to a connectivity matrix for organic compounds’ J. Chem. Doe. Vo17 (1967) pp 200-204

Thomson, L H et al. ‘Organic search and display using a connectivity matrix derived from the Wiswesser notation’ J. Chem. Dot. Vol 7 (1967) pp 204-207

168 Journal of Molecular Graphics

Page 5: In-house chemical databases at Imperial Chemical Industries

6 Hyde, E and Thomson, L H ‘Structure display’ J. Chem. Dot. Vo18 (1968) pp 138-146

7 Eakin, D R ‘The ICI CROSSBOW. system’ in Ash, J E and Hyde, E (eds) Chemical information systems Horwood, UK (1975)

8 Ash, J E ‘Connection tables and their role in a system’ in Ash, J E and Hyde, E (eds) Chemical infor- mation systems Horwood, UK (1975)

9 Eakin, D R et al. ‘The user of computers with chemical structural information: ICI Crossbow system’ Pestic. Sci. (1974) pp 319-326

10 Townsley, E E and Warr, W A ‘Chemical and bio-

logical data - an integrated online approach’ ACS Sypm. Ser. (1978) No 84

11 Allen, F H et al. ‘The Cambridge Crystallographic Data Centre: computer-based search retrieval, analy- sis and display of information’ Acta. Crystallogr. Vol B35 (1979) pp 233 l-2339

12 Hansch, C ‘A quantitative approach to biological structure-activity relationships’ Act. Chem. Res. Vol 2 (1969) pp 232-239

13 Walker, S B ‘Development of CAOCI and its use in ICI Plant Protection Division’ J. Chem. Znf. Comput. Sci. Vol 23 (1983) pp 3-5

Volume 4 Number 3 September 1986 169