the collection, curation and modeling of open melting point measurements august 26, 2011 5 th...
Post on 22-Dec-2015
213 views
TRANSCRIPT
The collection, curation and modeling of Open Melting Point measurements
August 26, 2011
5th Meeting on U.S. Government Chemical Databases and Open Chemistry
Jean-Claude Bradley
Department of ChemistryDrexel University
Andrew Lang
Department of MathematicsOral Roberts University
Antony Williams
ChemSpiderRoyal Society of
Chemistry
The Problem of Data Quality in Chemistry
• Lack of provenance
•Reliance on a system of “trusted sources”
• CRC Handbook•Merck Index• Chemical Vendor Catalogs (e.g. Sigma-Aldrich)• Peer-Reviewed Journals
In the case of melting points:
Strategy for the curation of melting points
Using technology, we can begin to replace the “trusted source”
model with one based on transparency and provenance
1. Rely on redundancy when possible2. Provide the maximum level of
provenance when necessary (Open Notebook Science)
3. Adhere to Open Data, Open Descriptors and Open Algorithms for measurements and modeling
The Chemical Information Validation Sheet
567 curated and referenced measurements from Fall 2010 Chemical Information Retrieval course
EPA/PHYSPROP Structure Errors (Incorrect Valence): 2315 out of 43543 were contained pentavalent
nitrogens
EPA/PHYSPROP Errors: Structure displayed is for the neutral compound dopamine but the associated CAS
Number and chemical name in the file are for the hydrobromide salt.
Common errors in datasets
1. multiple melting points for the same compound in the same database
2. stereochemistry issues3. sign inversion4. conversion errors (Kelvin/Celcius
Fahrenheit/Celcius)5. bad SMILES (non-rendering)6. salts associated with SMILES for free base7. using boiling point for melting point
Open melting point datasets
Double+ validated: 2706 compounds (7413 highly curated measurements. range: 0.01-5 C. Compounds that had at least one chiral center, possessed cis/trans isomerism, were inorganic or a salt removed.)
Entire dataset: 19933 unique compounds (27684 measurements – no inorganics or salts)
Modeling Results
Model Training set Test set (TS) Descriptors TS AAE TS RMSE TS R2
1 2205 500 132 2D 29.51 40.91 0.82
1 2204 500 170 2D/3D 29.52 40.79 0.83
2 16015 500 137 2D 26.62 36.35 0.86
3 16015 3500 137 2D 29.36 40.18 0.81
Straight chain carboxylic acids from 1 to 10 carbons
Straight chain alcohols from 1 to 10 carbons
Comparison of model with triple validated measurements
Cyclic primary amines from 3 to 6 carbons (cyclobutylamine flagged for validation – only single
source available)