capturing chemistry in xml/cml j. a. townsend *, s. e. adams *, j. m. goodman *, p. murray-rust *,...

Post on 19-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Capturing Chemistry in XML/CML

J. A. Townsend*, S. E. Adams

* , J. M. Goodman

*,

P. Murray-Rust*, C. A. Waudby

*

Capturing Chemistry in XML/CML

ACS March 2004

* Unilever Centre for Molecular Informatics,University of Cambridge

The Agony Of Publication - Loss

Capturing Chemistry in XML/CML

ACS March 2004

The World

The Agony Of Publication - Loss

Capturing Chemistry in XML/CML

ACS March 2004

The World

Sad

The Scientist

The Lab

Journals

Web Pages

The Vision-1Capturing Chemistry in

XML/CMLACS March 2004

<scalar dictRef=“ccml:mp”

units=“units:c”

minValue=“65”

maxValue=“66” />

mp 65-66 C

Human-readable Machine-readable

The Vision-2

•Chemists can carry on doing what they want

Capturing Chemistry in XML/CML

ACS March 2004

Reuse chemistryArchive dataEnsure validity of dataCreate new sources of data /

molecules

But also

Our Approach

•Let chemists use familiar programs …•…and document templates•Focus on Journal Articles, Theses,

CompChem•Create data for knowledge-based

discovery•Let computers do the work•Evolution…

Capturing Chemistry in XML/CML

ACS March 2004

Machine Parsing of Chemistry

Capturing Chemistry in XML/CML

ACS March 2004

Structured(CompChem)

Semi-Structured(Articles)

Unstructured(Discussion)

Structured documents and

data in XML

MACHINE

PARSING

?

Abstract

Discussion

Experimental

How?Capturing Chemistry in

XML/CMLACS March 2004

Article

semi-structured

Add Structure

Parse withRegular

Expressions

Legacy to CML converters

Regular Expressions

Capturing Chemistry in XML/CML

ACS March 2004

[Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C

Maybe ‘.’Any

punctuation0 or more

digitsCapital

‘C’

Melting point: two possible syntaxes

Capital or lowercase ‘m’

Lowercase‘p’

Maybewhitespace

Maybedegrees sign

m.p. > 23.5 °C

mp 23.5 – 25 °C

CML - XML For Chemistry

•Based on W3C XML Schemas •300+ components•Customisable •Extensible through dictionaries•Openly available software

Capturing Chemistry in XML/CML

ACS March 2004

J. Chem. Inf. Comp. Sci., 2003, 43, 757

The CML Family

Controlled XMLNamespaces:

CMLCore – compounds and propertiesCMLReact – reactions

CMLSpect – spectra*

CMLComp – compChemCMLCryst – crystallography and condensed matter

Interoperates with HTML, MathML, SVG, *AniML+, *ThermoML$, etc.

Capturing Chemistry in XML/CML

ACS March 2004

+spectra: ANSI/JCAMP$thermochemistry: NIST

J. Chem. Inf. Comp. Sci., 2003, 43, 757

Case Studies

Parsing output from 750,000 MOPAC jobs

High-throughput parsing of journals

Capturing Chemistry in XML/CML

ACS March 2004

CompChem LogsCapturing Chemistry in

XML/CMLACS March 2004

Coordinates

Molecular

Formula

Calculation Type

Point Group

Dipole

Total Energy

Loss From CompChem

Capturing Chemistry in XML/CML

ACS March 2004

Coordinates

Molecular

Formula

Calculation Type

Dipole

Total Energy

Ionisation Potential

Loss From CompChem

Capturing Chemistry in XML/CML

ACS March 2004

Coordinates

Molecular

Formula

Calculation Type

Dipole

Total Energy

Ionisation Potential

CompChemOutput

Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

Coordinates

Energy Levels

Vibrations

Coordinates

Energy Level

Vibration

CML File

CMLCore

CMLCore

CMLComp

CMLSpect

Input/jobControl General

Parsers

Display Process 1Capturing Chemistry in

XML/CMLACS March 2004

CompChem LogXindice

CML

XSLT

Display Process 2Capturing Chemistry in

XML/CMLACS March 2004

CML File

CMLCore

CMLCore

CMLComp

CMLSpect

compChemOutput

3D structure,electronicproperties

Coordinates

Energy Levels

Vibrations

Input/jobControl XSLT

Display

Normal modes

2D structure, thermodynamic

properties

Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

Dictionary Entry:The pointgroup of a molecule ...The Schoenflies convention is normally used, but Hermann Mauguin is also allowed.

D [debye]ParentSI: c.mMultiplier: 3.335641E-30CGS units for electric dipole

DictionariesCapturing Chemistry in

XML/CMLACS March 2004

<scalar dictRef=“ccml:mp”

units=“units:c”

minValue=“65”

maxValue=“66” />

Linked to CML schema

Accesses CCML namespace

Units dictionaryid="celsius" name="Celsius" parentSI="k"multiplierToSI="1" constantToSI="273.15" abbreviation="C" unitType="temp"

id="meltrange" term="Melting range"definition="Minimum and maximum values of melting range in degrees Celsius"

OSCAR

Open Source Chemistry Analysis Routines

Capturing Chemistry in XML/CML

ACS March 2004

Sponsored by the Royal Society of Chemistry (Cambridge)

Mounted on http://www.rsc.org/

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Synthesis

Set up

Analysis

Compound Name

Article Experimental

Information Checked / Extracted

Capturing Chemistry in XML/CML

ACS March 2004

•Chemical name

•Yield

•Boiling / Melting point

•Carbon NMR

•Hydrogen NMR

•Infra Red spectrometry

•Mass spectrometry

•Elemental Analysis

•Optical Rotation

•Refractive Index

•Rf value

•Ultra Violet spectrometry

•Nature (colour, state, modifiers, description,

etc.)

OSCAR Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

H NMR

Nature

HRMS

OSCAR Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

OSCAR Data FoundCapturing Chemistry in

XML/CMLACS March 2004

Results from one paper

OSCAR Error Checking

Capturing Chemistry in XML/CML

ACS March 2004

Serious Error

Warning Type 1

Warning Type

2

OSCAR Error Checking

Capturing Chemistry in XML/CML

ACS March 2004

~30 errors / warnings searched for

This article has:4 errors2 warnings (type 1)30 warnings (type 2)

Elemental analysis, incorrect – calculations are for a different molecular formula

OSCAR Data Presentation

Capturing Chemistry in XML/CML

ACS March 2004

OSCAR SpeedCapturing Chemistry in

XML/CMLACS March 2004

A typical paper contains ca. 20 compounds

JOC (Feb 2004) contains ~600 compounds

OSCAR could extract and tabulate in under 5 minutes

OBC (Feb 2004) contains ~300 compounds

OSCAR could extract and tabulate in under 3 minutes

High throughput, high precision

OSCAR AccuracyCapturing Chemistry in

XML/CMLACS March 2004

92 % of Data Correctly Identified

3 % incorrect author entry

5 % missed

437 items, ~10,000 data fields in test set,working with current Regular Expressions

False-positives: 3 %

XML-CML Databases

Capturing Chemistry in XML/CML

ACS March 2004

CMLJournals

Theses

CompChem

XMLDb can support > 250,000 moleculesMillisecond retrieval on INChI, properties

Xindice

Capturing Molecules

Capturing Chemistry in XML/CML

ACS March 2004

•Autogenerate IUPAC INChI universal identifier•Embed MDLMol or Chemdraw files in MSWord•Autoconvert to CML connection table

•Next phase:•Parse chemical names into CML using modern

NLP+

•Learning-machine rather than rule-based

•+Natural Language Processing

Encourage chemists to

NLP & Parsing Names

Capturing Chemistry in XML/CML

ACS March 2004

KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride

Thank You

UnileverRSC

Jonathan GoodmanSam Adams

Fraser NortonChris WaudbyYong Zhang

Capturing Chemistry in XML/CML

ACS March 2004

top related