capturing chemistry in xml/cml j. a. townsend *, s. e. adams *, j. m. goodman *, p. murray-rust *,...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Capturing Chemistry in XML/CML
J. A. Townsend*, S. E. Adams
* , J. M. Goodman
*,
P. Murray-Rust*, C. A. Waudby
*
Capturing Chemistry in XML/CML
ACS March 2004
* Unilever Centre for Molecular Informatics,University of Cambridge
The Agony Of Publication - Loss
Capturing Chemistry in XML/CML
ACS March 2004
The World
The Agony Of Publication - Loss
Capturing Chemistry in XML/CML
ACS March 2004
The World
Sad
The Scientist
The Lab
Journals
Web Pages
The Vision-1Capturing Chemistry in
XML/CMLACS March 2004
<scalar dictRef=“ccml:mp”
units=“units:c”
minValue=“65”
maxValue=“66” />
mp 65-66 C
Human-readable Machine-readable
The Vision-2
•Chemists can carry on doing what they want
Capturing Chemistry in XML/CML
ACS March 2004
Reuse chemistryArchive dataEnsure validity of dataCreate new sources of data /
molecules
But also
Our Approach
•Let chemists use familiar programs …•…and document templates•Focus on Journal Articles, Theses,
CompChem•Create data for knowledge-based
discovery•Let computers do the work•Evolution…
Capturing Chemistry in XML/CML
ACS March 2004
Machine Parsing of Chemistry
Capturing Chemistry in XML/CML
ACS March 2004
Structured(CompChem)
Semi-Structured(Articles)
Unstructured(Discussion)
Structured documents and
data in XML
MACHINE
PARSING
?
Abstract
Discussion
Experimental
How?Capturing Chemistry in
XML/CMLACS March 2004
Article
semi-structured
Add Structure
Parse withRegular
Expressions
Legacy to CML converters
Regular Expressions
Capturing Chemistry in XML/CML
ACS March 2004
[Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C
Maybe ‘.’Any
punctuation0 or more
digitsCapital
‘C’
Melting point: two possible syntaxes
Capital or lowercase ‘m’
Lowercase‘p’
Maybewhitespace
Maybedegrees sign
m.p. > 23.5 °C
mp 23.5 – 25 °C
CML - XML For Chemistry
•Based on W3C XML Schemas •300+ components•Customisable •Extensible through dictionaries•Openly available software
Capturing Chemistry in XML/CML
ACS March 2004
J. Chem. Inf. Comp. Sci., 2003, 43, 757
The CML Family
Controlled XMLNamespaces:
CMLCore – compounds and propertiesCMLReact – reactions
CMLSpect – spectra*
CMLComp – compChemCMLCryst – crystallography and condensed matter
Interoperates with HTML, MathML, SVG, *AniML+, *ThermoML$, etc.
Capturing Chemistry in XML/CML
ACS March 2004
+spectra: ANSI/JCAMP$thermochemistry: NIST
J. Chem. Inf. Comp. Sci., 2003, 43, 757
Case Studies
Parsing output from 750,000 MOPAC jobs
High-throughput parsing of journals
Capturing Chemistry in XML/CML
ACS March 2004
CompChem LogsCapturing Chemistry in
XML/CMLACS March 2004
Coordinates
Molecular
Formula
Calculation Type
Point Group
Dipole
Total Energy
Loss From CompChem
Capturing Chemistry in XML/CML
ACS March 2004
Coordinates
Molecular
Formula
Calculation Type
Dipole
Total Energy
Ionisation Potential
Loss From CompChem
Capturing Chemistry in XML/CML
ACS March 2004
Coordinates
Molecular
Formula
Calculation Type
Dipole
Total Energy
Ionisation Potential
CompChemOutput
Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
Coordinates
Energy Levels
Vibrations
Coordinates
Energy Level
Vibration
CML File
CMLCore
CMLCore
CMLComp
CMLSpect
Input/jobControl General
Parsers
Display Process 1Capturing Chemistry in
XML/CMLACS March 2004
CompChem LogXindice
CML
XSLT
Display Process 2Capturing Chemistry in
XML/CMLACS March 2004
CML File
CMLCore
CMLCore
CMLComp
CMLSpect
compChemOutput
3D structure,electronicproperties
Coordinates
Energy Levels
Vibrations
Input/jobControl XSLT
Display
Normal modes
2D structure, thermodynamic
properties
Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
Dictionary Entry:The pointgroup of a molecule ...The Schoenflies convention is normally used, but Hermann Mauguin is also allowed.
D [debye]ParentSI: c.mMultiplier: 3.335641E-30CGS units for electric dipole
DictionariesCapturing Chemistry in
XML/CMLACS March 2004
<scalar dictRef=“ccml:mp”
units=“units:c”
minValue=“65”
maxValue=“66” />
Linked to CML schema
Accesses CCML namespace
Units dictionaryid="celsius" name="Celsius" parentSI="k"multiplierToSI="1" constantToSI="273.15" abbreviation="C" unitType="temp"
id="meltrange" term="Melting range"definition="Minimum and maximum values of melting range in degrees Celsius"
OSCAR
Open Source Chemistry Analysis Routines
Capturing Chemistry in XML/CML
ACS March 2004
Sponsored by the Royal Society of Chemistry (Cambridge)
Mounted on http://www.rsc.org/
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Synthesis
Set up
Analysis
Compound Name
Article Experimental
Information Checked / Extracted
Capturing Chemistry in XML/CML
ACS March 2004
•Chemical name
•Yield
•Boiling / Melting point
•Carbon NMR
•Hydrogen NMR
•Infra Red spectrometry
•Mass spectrometry
•Elemental Analysis
•Optical Rotation
•Refractive Index
•Rf value
•Ultra Violet spectrometry
•Nature (colour, state, modifiers, description,
etc.)
OSCAR Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
H NMR
Nature
HRMS
OSCAR Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
OSCAR Data FoundCapturing Chemistry in
XML/CMLACS March 2004
Results from one paper
OSCAR Error Checking
Capturing Chemistry in XML/CML
ACS March 2004
Serious Error
Warning Type 1
Warning Type
2
OSCAR Error Checking
Capturing Chemistry in XML/CML
ACS March 2004
~30 errors / warnings searched for
This article has:4 errors2 warnings (type 1)30 warnings (type 2)
Elemental analysis, incorrect – calculations are for a different molecular formula
OSCAR Data Presentation
Capturing Chemistry in XML/CML
ACS March 2004
OSCAR SpeedCapturing Chemistry in
XML/CMLACS March 2004
A typical paper contains ca. 20 compounds
JOC (Feb 2004) contains ~600 compounds
OSCAR could extract and tabulate in under 5 minutes
OBC (Feb 2004) contains ~300 compounds
OSCAR could extract and tabulate in under 3 minutes
High throughput, high precision
OSCAR AccuracyCapturing Chemistry in
XML/CMLACS March 2004
92 % of Data Correctly Identified
3 % incorrect author entry
5 % missed
437 items, ~10,000 data fields in test set,working with current Regular Expressions
False-positives: 3 %
XML-CML Databases
Capturing Chemistry in XML/CML
ACS March 2004
CMLJournals
Theses
CompChem
XMLDb can support > 250,000 moleculesMillisecond retrieval on INChI, properties
Xindice
Capturing Molecules
Capturing Chemistry in XML/CML
ACS March 2004
•Autogenerate IUPAC INChI universal identifier•Embed MDLMol or Chemdraw files in MSWord•Autoconvert to CML connection table
•Next phase:•Parse chemical names into CML using modern
NLP+
•Learning-machine rather than rule-based
•+Natural Language Processing
Encourage chemists to
NLP & Parsing Names
Capturing Chemistry in XML/CML
ACS March 2004
KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
Thank You
UnileverRSC
Jonathan GoodmanSam Adams
Fraser NortonChris WaudbyYong Zhang
Capturing Chemistry in XML/CML
ACS March 2004