capturing chemistry in xml/cml j. a. townsend *, s. e. adams *, j. m. goodman *, p. murray-rust *,...

40
Capturing Chemistry in XML/CML J. A. Townsend * , S. E. Adams * , J. M. Goodman * , P. Murray-Rust * , C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Capturing Chemistry in XML/CML

J. A. Townsend*, S. E. Adams

* , J. M. Goodman

*,

P. Murray-Rust*, C. A. Waudby

*

Capturing Chemistry in XML/CML

ACS March 2004

* Unilever Centre for Molecular Informatics,University of Cambridge

Page 2: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

The Agony Of Publication - Loss

Capturing Chemistry in XML/CML

ACS March 2004

The World

Page 3: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

The Agony Of Publication - Loss

Capturing Chemistry in XML/CML

ACS March 2004

The World

Sad

The Scientist

The Lab

Journals

Web Pages

Page 4: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

The Vision-1Capturing Chemistry in

XML/CMLACS March 2004

<scalar dictRef=“ccml:mp”

units=“units:c”

minValue=“65”

maxValue=“66” />

mp 65-66 C

Human-readable Machine-readable

Page 5: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

The Vision-2

•Chemists can carry on doing what they want

Capturing Chemistry in XML/CML

ACS March 2004

Reuse chemistryArchive dataEnsure validity of dataCreate new sources of data /

molecules

But also

Page 6: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Our Approach

•Let chemists use familiar programs …•…and document templates•Focus on Journal Articles, Theses,

CompChem•Create data for knowledge-based

discovery•Let computers do the work•Evolution…

Capturing Chemistry in XML/CML

ACS March 2004

Page 7: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Machine Parsing of Chemistry

Capturing Chemistry in XML/CML

ACS March 2004

Structured(CompChem)

Semi-Structured(Articles)

Unstructured(Discussion)

Structured documents and

data in XML

MACHINE

PARSING

?

Page 8: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Abstract

Discussion

Experimental

How?Capturing Chemistry in

XML/CMLACS March 2004

Article

semi-structured

Add Structure

Parse withRegular

Expressions

Legacy to CML converters

Page 9: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Regular Expressions

Capturing Chemistry in XML/CML

ACS March 2004

[Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C

Maybe ‘.’Any

punctuation0 or more

digitsCapital

‘C’

Melting point: two possible syntaxes

Capital or lowercase ‘m’

Lowercase‘p’

Maybewhitespace

Maybedegrees sign

m.p. > 23.5 °C

mp 23.5 – 25 °C

Page 10: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

CML - XML For Chemistry

•Based on W3C XML Schemas •300+ components•Customisable •Extensible through dictionaries•Openly available software

Capturing Chemistry in XML/CML

ACS March 2004

J. Chem. Inf. Comp. Sci., 2003, 43, 757

Page 11: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

The CML Family

Controlled XMLNamespaces:

CMLCore – compounds and propertiesCMLReact – reactions

CMLSpect – spectra*

CMLComp – compChemCMLCryst – crystallography and condensed matter

Interoperates with HTML, MathML, SVG, *AniML+, *ThermoML$, etc.

Capturing Chemistry in XML/CML

ACS March 2004

+spectra: ANSI/JCAMP$thermochemistry: NIST

J. Chem. Inf. Comp. Sci., 2003, 43, 757

Page 12: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Case Studies

Parsing output from 750,000 MOPAC jobs

High-throughput parsing of journals

Capturing Chemistry in XML/CML

ACS March 2004

Page 13: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

CompChem LogsCapturing Chemistry in

XML/CMLACS March 2004

Coordinates

Molecular

Formula

Calculation Type

Point Group

Dipole

Total Energy

Page 14: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Loss From CompChem

Capturing Chemistry in XML/CML

ACS March 2004

Coordinates

Molecular

Formula

Calculation Type

Dipole

Total Energy

Ionisation Potential

Page 15: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Loss From CompChem

Capturing Chemistry in XML/CML

ACS March 2004

Coordinates

Molecular

Formula

Calculation Type

Dipole

Total Energy

Ionisation Potential

Page 16: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

CompChemOutput

Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

Coordinates

Energy Levels

Vibrations

Coordinates

Energy Level

Vibration

CML File

CMLCore

CMLCore

CMLComp

CMLSpect

Input/jobControl General

Parsers

Page 17: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Display Process 1Capturing Chemistry in

XML/CMLACS March 2004

CompChem LogXindice

CML

XSLT

Page 18: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Display Process 2Capturing Chemistry in

XML/CMLACS March 2004

CML File

CMLCore

CMLCore

CMLComp

CMLSpect

compChemOutput

3D structure,electronicproperties

Coordinates

Energy Levels

Vibrations

Input/jobControl XSLT

Display

Normal modes

2D structure, thermodynamic

properties

Page 19: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

Dictionary Entry:The pointgroup of a molecule ...The Schoenflies convention is normally used, but Hermann Mauguin is also allowed.

D [debye]ParentSI: c.mMultiplier: 3.335641E-30CGS units for electric dipole

Page 20: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

DictionariesCapturing Chemistry in

XML/CMLACS March 2004

<scalar dictRef=“ccml:mp”

units=“units:c”

minValue=“65”

maxValue=“66” />

Linked to CML schema

Accesses CCML namespace

Units dictionaryid="celsius" name="Celsius" parentSI="k"multiplierToSI="1" constantToSI="273.15" abbreviation="C" unitType="temp"

id="meltrange" term="Melting range"definition="Minimum and maximum values of melting range in degrees Celsius"

Page 21: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR

Open Source Chemistry Analysis Routines

Capturing Chemistry in XML/CML

ACS March 2004

Sponsored by the Royal Society of Chemistry (Cambridge)

Mounted on http://www.rsc.org/

Page 22: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Page 23: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Page 24: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Page 25: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Page 26: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Article

Page 27: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Article StructureCapturing Chemistry in

XML/CMLACS March 2004

Front Matter

Abstract

Introduction

Discussion

Experimental

References

Results

Synthesis

Set up

Analysis

Compound Name

Article Experimental

Page 28: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Information Checked / Extracted

Capturing Chemistry in XML/CML

ACS March 2004

•Chemical name

•Yield

•Boiling / Melting point

•Carbon NMR

•Hydrogen NMR

•Infra Red spectrometry

•Mass spectrometry

•Elemental Analysis

•Optical Rotation

•Refractive Index

•Rf value

•Ultra Violet spectrometry

•Nature (colour, state, modifiers, description,

etc.)

Page 29: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

H NMR

Nature

HRMS

Page 30: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR Parsing DataCapturing Chemistry in

XML/CMLACS March 2004

Page 31: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR Data FoundCapturing Chemistry in

XML/CMLACS March 2004

Results from one paper

Page 32: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR Error Checking

Capturing Chemistry in XML/CML

ACS March 2004

Serious Error

Warning Type 1

Warning Type

2

Page 33: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR Error Checking

Capturing Chemistry in XML/CML

ACS March 2004

~30 errors / warnings searched for

This article has:4 errors2 warnings (type 1)30 warnings (type 2)

Elemental analysis, incorrect – calculations are for a different molecular formula

Page 34: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR Data Presentation

Capturing Chemistry in XML/CML

ACS March 2004

Page 35: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR SpeedCapturing Chemistry in

XML/CMLACS March 2004

A typical paper contains ca. 20 compounds

JOC (Feb 2004) contains ~600 compounds

OSCAR could extract and tabulate in under 5 minutes

OBC (Feb 2004) contains ~300 compounds

OSCAR could extract and tabulate in under 3 minutes

High throughput, high precision

Page 36: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

OSCAR AccuracyCapturing Chemistry in

XML/CMLACS March 2004

92 % of Data Correctly Identified

3 % incorrect author entry

5 % missed

437 items, ~10,000 data fields in test set,working with current Regular Expressions

False-positives: 3 %

Page 37: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

XML-CML Databases

Capturing Chemistry in XML/CML

ACS March 2004

CMLJournals

Theses

CompChem

XMLDb can support > 250,000 moleculesMillisecond retrieval on INChI, properties

Xindice

Page 38: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Capturing Molecules

Capturing Chemistry in XML/CML

ACS March 2004

•Autogenerate IUPAC INChI universal identifier•Embed MDLMol or Chemdraw files in MSWord•Autoconvert to CML connection table

•Next phase:•Parse chemical names into CML using modern

NLP+

•Learning-machine rather than rule-based

•+Natural Language Processing

Encourage chemists to

Page 39: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

NLP & Parsing Names

Capturing Chemistry in XML/CML

ACS March 2004

KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride

Page 40: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March

Thank You

UnileverRSC

Jonathan GoodmanSam Adams

Fraser NortonChris WaudbyYong Zhang

Capturing Chemistry in XML/CML

ACS March 2004