rsc chemspider -- managing and integrating chemistry on the internet to build community for chemists

127
Managing and Integrating Chemistry on the Internet to Build Community for Chemists Lawrence Berkeley National Laboratory, March 2010,

Upload: orcid-0000-0002-2668-4821

Post on 11-May-2015

2.817 views

Category:

Documents


2 download

DESCRIPTION

The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. The Royal Society of Chemistry hosts ChemSpider, a free access website for chemists built with the intention of building community for chemists (http://www.chemspider.com/). ChemSpider is an aggregator of chemistry related information, at present over 20 million unique chemical entities linked out to over 300 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. It is also a public deposition platform where chemists can deposit their own data including novel structures, analytical data, synthesis procedures and host data associated with the growing activities associated with Open Notebook Science. This presentation will examine chemistry on the internet, the dubious quality of what is available and how the ChemSpider crowdsourced curation platform is fast becoming one of the centralized hubs for resourcing information about chemical entities. We will also review our efforts to provide free resources for synthesis procedures, spectral data and structure-based searching of the chemistry literature and how chemists can contribute directly to each of these projects.

TRANSCRIPT

Page 1: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Lawrence Berkeley National Laboratory, March 2010,

Page 2: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 3: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 4: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 5: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 6: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

The Final Search Strategy

Page 7: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

All Those Names, One StructureA problem to solve…

Page 8: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 9: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Trustworthy Chemistry? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 10: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Where Would You look? What Do You Trust?

Page 11: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Question Everything online: www.dhmo.org

Page 12: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Di-Hydrogen Monoxide

2H

Page 13: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Di-Hydrogen Monoxide

2H + 1O

Page 14: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Di-Hydrogen Monoxide

H2O

Page 15: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Di-Hydrogen Monoxide

H2OWater

Page 16: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

It’s all on Wikipedia…

Page 17: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Chemistry on The Internet Is Messy

Page 18: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

It’s Methane…

Page 19: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

What’s Methane?

Page 20: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

What’s Methane?

Page 21: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

What ELSE is Methane???

Page 22: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Drugs are REALLY Messy

Page 23: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

Page 24: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

The EXPERTS must get it right?!

Page 25: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Wikipedia, C&E News, PubChem C&E News (from ACS)

Page 26: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Feedback from C&E Senior Editor

“Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”

“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”

Page 27: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Structural Data for LifeSciencesDailyMed

Page 28: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Lack of Stereochemisty

Page 29: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Incorrect Structures

Page 30: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Ugh…

Page 31: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 32: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Just “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider

Page 33: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

media.obsessable.com

As few interfaces as possible

What do humans want?

Page 34: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

A Pragmatic Vision“Build a Structure Centric Community to

Serve Chemists”

December 2006 – A hobby project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Page 35: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Answer Questions

Questions a student might ask… What is the structure of levulinic acid? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? How can I synthesize 2,4-dichlorophenol? What are the safety handling issues for Thymol Blue?

Page 36: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

What is Levulinic Acid?

Page 37: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

What is Levulinic Acid?

Page 38: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Basic Info

Page 39: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Wikipedia and External Links

Page 40: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

External Links to Data

Page 41: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Linked across the internet

Page 42: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Kyoto Encyclopedia of Genes and Genomes

Page 43: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Google Patent Integration

Page 44: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Access to Articles

RSC Journals RSC Books PubMed Google Scholar Google Books Microsoft Academic Search

Page 45: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Access to Articles

Page 46: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Google Scholar

Page 47: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Experimental and Predicted Properties

Page 48: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider : Spectra Linked

Page 49: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 50: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Search “OEA”

Page 51: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Search OEA

Page 52: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Search OEA

Page 53: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Search OEA

Page 54: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Linked Patents for OEA

Page 55: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 56: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Statistics for Today

>25 million compounds from >300 data sources

About 7000 unique users per day and up to ½ million transactions per day

A crowdsourced deposition and curation platform

Grows daily – more depositions, more links, more data

Page 57: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Searching Chemistry on the Internet

How complete a result set will we get if we search for “chemicals” by name?

Is there a better way to link chemistry databases? Linking by “names” is dangerous

Chemists want structure and SUBstructure searching

Page 58: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

The InChI Identifier

Page 59: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Multiple Layers

Page 60: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

InChIStrings Hash to InChIKeys

Page 61: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Link the Internet with InChIKeys!

Taken from: Rafael Sidis’ Blog

Page 62: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Vancomycin – Search the Internet

Page 63: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Page 64: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Full Molecule Search: 4 Hits

Page 65: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Full Skeleton Search: 104 Hits

Page 66: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 67: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 68: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 69: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Vancomycin on ChemSpider 1 compound – 3 days

Page 70: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

Page 71: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

is what???

Page 72: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

The InChI “Resolver”

Page 73: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

InChI Resolver to DOIsStructure Search the Web

Page 74: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Most Chemistry is NOT Published

Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found

Page 75: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

The CAS Registry

Page 76: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

CAS Registry

Page 77: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Crowd-sourcing Curation and Deposition

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 78: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Building a Structure Centric Community for Chemists

Multi-level Curation and Approval

Page 79: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Entity-Extraction, Mark-up, Annotate

Page 80: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Semantic Markup: Project Prospect

Page 81: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Success Depends on Dictionaries

Link to a Structure or the Right Structure?

Page 82: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Name-Structure Pairs

Page 83: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 84: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Org Prep Daily (Blog)

Page 85: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Micro- and Nano-publications

Blogs, wiki entries and even Amazon book reviews are micro/nano-publications

ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume

Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data.

Page 86: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider SyntheticPages

Page 87: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Submission process Register as a user Use the Submit button and fill in the fields…

Page 88: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Submission Process

Submissions reviewed by editorial board

Published as is or comments sent to author

Online Peer Review process

Data supported include web movies, images, live spectra etc.

Page 89: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider : Spectra Linked

Page 90: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Spectra Linked

Page 91: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Spectra Linked

Page 92: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider ID 24528095 H1 NMR

Page 93: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider ID 24528095 C13 NMR

Page 94: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider ID 24528095 HHCOSY

Page 95: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider ID 24528095 HSQC

Page 96: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider ID 24528095 HMBC

Page 97: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Full C13 assignment uploaded

Page 98: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Not Just NMR Data

Page 99: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Spectra on ChemSpider

Page 100: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Available Spectra http://www.chemspider.com/spectra.aspx

Page 101: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Sources of Spectra

Sourced from online sources with permission

Private collections

The MAJORITY deposited by ChemSpider users

Page 102: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

How Could Students Help? Part 1

Students can help “curate” the data – check whether the spectra are consistent with the compound

If not then flag them, annotate them and provide feedback

OR…play the game

Page 103: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

Page 104: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Spectral Game

Page 105: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Increasing Complexity

Page 106: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Spectral Game

Page 107: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

True Curation of Data

Page 108: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

How Could Students Help? Part 2

Add their own data to the database!

Spectra from: research projects lab sessions supplementary data sections in publications

Page 109: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Spectral Uploading

Locate the structure of interest and deposit spectrum

Page 110: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Spectral Uploading Various types of NMR spectra supported

Page 111: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Deposit spectra against new structure

If a NEW compound has spectral data then deposit the structure onto ChemSpider first

Page 112: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

How Else Can Students Help?

Students can deposit single structures or thousands of structures – UNIQUE chemistry can be added and “claimed”

Data can be curated/edited and annotated – simply register and request the rights

25 million structures, >300 data sources…there are errors of course!

Page 113: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

NMRShiftDB

Page 114: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

NMRShiftDB: http://www.ebi.ac.uk/nmrshiftdb/

Page 115: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists
Page 116: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

NMR Prediction

Page 117: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Multinuclear NMR Prediction

Page 118: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

NMRShiftDB Data Review

• High quality NMR shift set of ca. 100,000 shifts• Multiple outliers identified • Removed followed publication

Page 119: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider Integrated NMR Prediction

Initial integration in place

Page 120: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

A Game Through Embedding Data

Page 121: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Embedding Structures

Page 122: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Do you write Wikipedia Articles?

Page 123: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Do you write Wikipedia Articles?

Page 124: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

ChemSpider Web Services

Page 125: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

How Can You Help ChemSpider?

Deposit your data and share with the community Structures – one or many Spectra Links Syntheses into SyntheticPages

Curate data – most basic level…just add comments Spread the word – ChemSpider is an untapped

resource

Page 126: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Classical business models will have to morph

Page 127: RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build Community for Chemists

Thank you

[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams