2012 acs skolnik symposium - chemspotlight
TRANSCRIPT
Automated Molecular Data Extraction using Open Babel & ChemSpotlight:
The Semantic Desktop
Prof. Geoff HutchisonDepartment of ChemistryUniversity of [email protected]
ACS CINF: Skolnik Symposium21 August 2012
http://hutchison.chem.pitt.edu
“
”— Prof. Henry S. Rzepa (Imperial College) Spring 2005 ACS Meeting, San Diego, CA
I can plug my iPod into any computer and it will recognize my music and give me all sorts of metadata: artist, title, type of music...
Why can’t I read the chemical metadata off my chemistry files?
Pre-History: Chem://Dig
Index files, websites
Based on Chem MIME
Find files on extension
Perceive chemistry
Database Store
Search, Filter
Retrieval
H. Rzepa et al. New J. Chem (2002) 26 p. 656
Open Babel
Open Babel (Started 2001)
http://openbabel.org/
Free, open source chemical toolbox
Cross-platform: Win, Mac, Linux...
Both user-tools & C++ library
Interfaces in Python, Perl, Ruby, Java, C#
Supports chemistry, bioinformatics, solid-state…
100+ file formats and variants
O’Boyle et al. J. Cheminf. 2011, 3:33
Chemical Database?
1. Some way to store data (Organize it)
2. Index it3. Search / filter4. Visualize results
ChemSpotlight: Indexing Architecture
Spotlight Open Babel
+ + ~300 lines of code
http://chemspotlight.openmolecules.net/
ChemSpotlight: “Un” Database
Use the system-wide search databaseNo (Visible) Database!
Index files in-place
Includes textual data(e.g., chemical names, formulas, etc.)
Multiple retrieval and filtering interfaces(i.e., any third-party search tool works)
http://chemspotlight.openmolecules.net/
So What’s Stored / Perceived
Formula, mass, SMILES, InChInet_sourceforge_openbabel_Formula = C21H36N7O8S
Fingerprints, number of atoms, bonds, residues
PDB, SDF keywords, properties
Calculation keywords:kMDItemComment = "Gaussian 09 #n B3LYP/6-31G(d) Opt"
Calculation results (HOMO, LUMO, Dipole Moment)net_sourceforge_chemspotlight_DipoleMoment = 3.5
ChemSpotlight “Un” Database
ChemSpotlight “Un” Database
How Do We Visualize?
“QuickLook” previews
New code ~800 lines
Generate SDF, PDB, CIF (if needed)
Pass off to ChemDoodleWeb Components
Pseudo-3D, interactive JS+ HTML5
… or SVG generation from Open Babel
http://web.chemdoodle.com/
Organic Heterojunction Solar Cells
p-type material
n-type material
Transparent Electrode
Reflective Electrode
light
+- Circuit
ΔE ≥ Exciton Binding Energy e-
h+
Optical Excitation
hν
Anode
Cathode
Effective
Heterojunction
Bandgap
Hole
Conducting
PolymerElectron
Conductor(Nanoparticle)
Organic Heterojunction Solar Cells
p-type material
n-type material
Transparent Electrode
Reflective Electrode
light
+- Circuit
Pipeline Model for Finding New Molecules
Monomers
...
>106
Possible Structures
ElectronicProperties
OpticalProperties
SyntheticScore
~9 m
inut
esJ Phys Chem C 2011 vol. 115 pp. 16200
Pipeline Model for Finding New Molecules
Monomers
Fast Screening
Slower
...
>106
Possible Structures
ElectronicProperties
OpticalProperties
SyntheticScore
~9 m
inut
esJ Phys Chem C 2011 vol. 115 pp. 16200
New Genetic Algorithm Approach
Rather than directly driving & wait for calc results
Check Spotlight for new results
“What are top HOMO energies?”
Update GA, generate new candidates, submit new jobs
Scaling Up the Polymer Solar Search
LUM
O E
nerg
y (e
V)
−3
−2
−1
0
HOMO Energy (eV)−9.5 −9.0 −8.5 −8.0 −7.5 −7.0 −6.5
2nd Gen. Search:
680 Monomers
2800+ Fragments
Search Space:500+ million oligomers
~9 minutes per core
S
Take-Home Messages
“Big Data” is a Big HeadacheChemSpotlight & Un-Databases Work!Keep data as native files w/separate indexIntegrate into user-friendly toolsSell to users: “What’s in it for me?”
Indexing, retrievalImproved workflows
Dr. Noel O’BoyleU.C. Cork, Ireland
Casey CampbellPitt (2010)
Marcus HanwellPitt / Kitware