rsc chemspider -- managing and integrating chemistry on the internet to build community for chemists
DESCRIPTION
The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. The Royal Society of Chemistry hosts ChemSpider, a free access website for chemists built with the intention of building community for chemists (http://www.chemspider.com/). ChemSpider is an aggregator of chemistry related information, at present over 20 million unique chemical entities linked out to over 300 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. It is also a public deposition platform where chemists can deposit their own data including novel structures, analytical data, synthesis procedures and host data associated with the growing activities associated with Open Notebook Science. This presentation will examine chemistry on the internet, the dubious quality of what is available and how the ChemSpider crowdsourced curation platform is fast becoming one of the centralized hubs for resourcing information about chemical entities. We will also review our efforts to provide free resources for synthesis procedures, spectral data and structure-based searching of the chemistry literature and how chemists can contribute directly to each of these projects.TRANSCRIPT
Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Lawrence Berkeley National Laboratory, March 2010,
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
The Final Search Strategy
All Those Names, One StructureA problem to solve…
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Trustworthy Chemistry? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Where Would You look? What Do You Trust?
Question Everything online: www.dhmo.org
Di-Hydrogen Monoxide
2H
Di-Hydrogen Monoxide
2H + 1O
Di-Hydrogen Monoxide
H2O
Di-Hydrogen Monoxide
H2OWater
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
Drugs are REALLY Messy
Vancomycin
Who will curate?
How would you clean such a large dataset?
Assertions!!!
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Feedback from C&E Senior Editor
“Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”
“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”
Structural Data for LifeSciencesDailyMed
Lack of Stereochemisty
Incorrect Structures
Ugh…
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Just “Public Compound” Databases
PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider
media.obsessable.com
As few interfaces as possible
What do humans want?
A Pragmatic Vision“Build a Structure Centric Community to
Serve Chemists”
December 2006 – A hobby project initiated to connect chemistry on the web
Integrate chemical structure data on the web Create a “structure-based hub” to information and
data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data
Answer Questions
Questions a student might ask… What is the structure of levulinic acid? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? How can I synthesize 2,4-dichlorophenol? What are the safety handling issues for Thymol Blue?
What is Levulinic Acid?
What is Levulinic Acid?
Basic Info
Wikipedia and External Links
External Links to Data
Linked across the internet
Kyoto Encyclopedia of Genes and Genomes
Google Patent Integration
Access to Articles
RSC Journals RSC Books PubMed Google Scholar Google Books Microsoft Academic Search
Access to Articles
Google Scholar
Experimental and Predicted Properties
ChemSpider : Spectra Linked
Search “OEA”
Search OEA
Search OEA
Search OEA
Linked Patents for OEA
Statistics for Today
>25 million compounds from >300 data sources
About 7000 unique users per day and up to ½ million transactions per day
A crowdsourced deposition and curation platform
Grows daily – more depositions, more links, more data
Searching Chemistry on the Internet
How complete a result set will we get if we search for “chemicals” by name?
Is there a better way to link chemistry databases? Linking by “names” is dangerous
Chemists want structure and SUBstructure searching
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Link the Internet with InChIKeys!
Taken from: Rafael Sidis’ Blog
Vancomycin – Search the Internet
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
Vancomycin on ChemSpider 1 compound – 3 days
InChIKeys
RCINICONZNJXQF-MZXODVADSA-N
Make the internet searchable by adding InChIKeys
Publishers add InChIKeys to papers now…
InChIKeys
RCINICONZNJXQF-MZXODVADSA-N
Make the internet searchable by adding InChIKeys
Publishers add InChIKeys to papers now…
is what???
The InChI “Resolver”
InChI Resolver to DOIsStructure Search the Web
Most Chemistry is NOT Published
Only a fraction of chemistry is published
Only a tiny fraction of chemistry is patented
What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found
The CAS Registry
CAS Registry
Crowd-sourcing Curation and Deposition
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Building a Structure Centric Community for Chemists
Multi-level Curation and Approval
Entity-Extraction, Mark-up, Annotate
Semantic Markup: Project Prospect
Success Depends on Dictionaries
Link to a Structure or the Right Structure?
Name-Structure Pairs
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
Org Prep Daily (Blog)
Micro- and Nano-publications
Blogs, wiki entries and even Amazon book reviews are micro/nano-publications
ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume
Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data.
ChemSpider SyntheticPages
Submission process Register as a user Use the Submit button and fill in the fields…
Submission Process
Submissions reviewed by editorial board
Published as is or comments sent to author
Online Peer Review process
Data supported include web movies, images, live spectra etc.
ChemSpider : Spectra Linked
Spectra Linked
Spectra Linked
ChemSpider ID 24528095 H1 NMR
ChemSpider ID 24528095 C13 NMR
ChemSpider ID 24528095 HHCOSY
ChemSpider ID 24528095 HSQC
ChemSpider ID 24528095 HMBC
Full C13 assignment uploaded
Not Just NMR Data
Spectra on ChemSpider
Available Spectra http://www.chemspider.com/spectra.aspx
Sources of Spectra
Sourced from online sources with permission
Private collections
The MAJORITY deposited by ChemSpider users
How Could Students Help? Part 1
Students can help “curate” the data – check whether the spectra are consistent with the compound
If not then flag them, annotate them and provide feedback
OR…play the game
www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9
Spectral Game
Increasing Complexity
Spectral Game
True Curation of Data
How Could Students Help? Part 2
Add their own data to the database!
Spectra from: research projects lab sessions supplementary data sections in publications
Spectral Uploading
Locate the structure of interest and deposit spectrum
Spectral Uploading Various types of NMR spectra supported
Deposit spectra against new structure
If a NEW compound has spectral data then deposit the structure onto ChemSpider first
How Else Can Students Help?
Students can deposit single structures or thousands of structures – UNIQUE chemistry can be added and “claimed”
Data can be curated/edited and annotated – simply register and request the rights
25 million structures, >300 data sources…there are errors of course!
NMRShiftDB
NMRShiftDB: http://www.ebi.ac.uk/nmrshiftdb/
NMR Prediction
Multinuclear NMR Prediction
NMRShiftDB Data Review
• High quality NMR shift set of ca. 100,000 shifts• Multiple outliers identified • Removed followed publication
ChemSpider Integrated NMR Prediction
Initial integration in place
A Game Through Embedding Data
Embedding Structures
Do you write Wikipedia Articles?
Do you write Wikipedia Articles?
ChemSpider Web Services
How Can You Help ChemSpider?
Deposit your data and share with the community Structures – one or many Spectra Links Syntheses into SyntheticPages
Curate data – most basic level…just add comments Spread the word – ChemSpider is an untapped
resource
Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,
syntheses, data, publications and patents A world of Open Access and Open Data
Classical business models will have to morph
Thank you
[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams