chemspider - building a crowdsourced chemical database for the chemistry community

Post on 10-May-2015

1.989 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is the presentation I gave at OpenSciNY 2010. It was a great gathering of Librarians and people interested in Open Science. Sharing the stage with Beth Brown Jean-Claude Bradley and Heather Joseph was, as usual, a good opportunity to discuss how openness and online data sharing is changing the way we access and share data. We live in interesting and exciting times.

TRANSCRIPT

ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community OpenSciNY, New York, May 2010,

Once Upon a Time Over a “Coffee”

Which is better for Plants?Vodka, Sprite or Viagra?

It Works – Viagra Wins the Day

Now Which is Better?

Viagra or Cialis?

Images sourced from Wikipedia

Cialis

I want…The structureAny patent informationRelated publicationsWhere can I buy it?Metabolic pathway infoWhat else is easy to find…

Cialis on Google?

What is Cialis?

What is Cialis? Can we trust Wikipedia?

What is Cialis?

6 hits on PubChem

What is Cialis?

Search by Trade Name

Are there other names???

Are there other names???

PubMed hits: 736 Tadalafil 744 Cialis

Are there other names???

Are There Other Names?

IC351 on PubChem?

5 HITS for IC351

ZERO HITS for IC 351

Chemistry on the Web

Text searching the web is far from optimal

The quality of data on the web is a problem

It may be hard to find but it is “out there”

What was once locked up behind an expensive license can generally be found

Structure searching the web is already possible!

Text Searching the Web

Text searching the web for chemical compounds is an enormous challenge

RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?

The RSC Publishing Platform (Beta)

2+2 = 4 Articles?

CAS Number Search

Text Searching the Web

Disambiguation dictionaries of name-structure relationships would be very enabling.

IC351 = IC 351 = Tadalafil = Cialis = …

Creating validated dictionaries is an enormous challenge to cover chemistry

CAS Registry – LOTS of Chemicals!

The Final Search StrategyA “Disambiguation Query!”

All Those Names, One StructureA problem to solve…

ChemSpider - A Pragmatic Vision

“Build a Structure Centric Community toServe Chemists”

Aggregate and integrate chemical structure data on the web – names, structures, links

Create a “structure-based hub” to information, data and algorithmic predictions

Let chemists contribute their own data Allow the community to curate/correct data

media.obsessable.com

As few interfaces as possible

What do humans want?

Aggregating Data – Who to Trust??? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Just “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors

Question Everything online: www.dhmo.org

Di-Hydrogen Monoxide

2H

Di-Hydrogen Monoxide

2H + 1O

Di-Hydrogen Monoxide

H2O

Di-Hydrogen Monoxide

H2OWater

It’s all on Wikipedia…

What About Gases? Methane…

What’s Methane?

What’s Methane?

What ELSE is Methane???

Structural Data for Life SciencesDailyMed

Lack of Stereochemisty

Incorrect Structures

Pragmatic Vision Delivered…

Aggregate, integrate and link data from across the internet

Almost 25 million structures from >300 data sources

Linked to vendors, literature, online databases (open and commercial), open notebook science, patents and….

Robotic and Crowdsourced Curation

Search “OEA”

Search OEA

Search OEA

Search OEA

Linked Patents for OEA

Answering Questions…

Questions a student might ask… What is the structure of levulinic acid? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? How can I synthesize 2,4-dichlorophenol? What are the safety handling issues for Thymol Blue?

Back to Cialis…

Cialis on ChemSpider : 1 hit

Chemicals are curated/validated on ChemSpider by ourselves and the community

Based on assertions from various sources. Iterative, time-consuming and exacting!

We believe we know the structure now

What is linked and available?

Google Patents

ChemSpider – Patents Linked

SURECHEM PATENTS GOOGLE

Google Books

Microsoft Academic Search

Google Scholar – Articles were found by CAS Number!

Identifiers for Tadalafil

How Many Articles in RSC Journals?

Based on 171596-29 -5 there are 13 articles in RSC journals

What about if we VALIDATE identifiers?

Validated Dictionaries Hit APIsThis is data curation...

Does this generate more results?

RSC Journals

RSC Journals

REMEMBER 2+2 = 4

PubMed

Google Scholar – Expanded Hit Set

Microsoft Academic Search

Microsoft Academic Search

Be careful! More mussels than drugs…

Searching Chemistry on the Internet

Do we get complete a result set will we get if we search for “chemicals” only by name?

Is there a better way to link chemistry databases? Linking by “names” is dangerous

Chemists want structure and SUBstructure searching

Structure Searching the Web

We have resources about Tadalafil actively linked to ChemSpider

What about searching the web for Tadalafil by structure…not based on the various identifiers

How?

Link the Internet with InChIKeys!

Taken from: Rafael Sidis’ Blog

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

Cialis – Searching the Web by InChI

Search Molecular SKELETON

Search Full Molecule

InChI Search the Web by Skeleton78 Hits by Skeleton

InChI Search the Web Exact Match32 Hits by InChIKey

InChI Search the Web Exact Match6 Hits by Standard InChIKey

InChifying the Web

There are more than 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?

Our judgment…MISTAKES

Vancomycin – Search the Internet

Full Molecule Search: 4 Hits

Full Skeleton Search: 104 Hits

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

But what is the structure???

We need an InChI “Resolver”

InChI Resolver to DOIsStructure Search the Web

Semantic Markup: Project Prospect

Depends on Validated Dictionaries

Link to a Structure or the Right Structure?

Name-Structure Pairs

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything” Through ChemSpider!

Unpublished Chemistry

Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found

Org Prep Daily (Blog)

ChemSpider SyntheticPages

Submission process Register as a user Use the Submit button and fill in the fields…

Submission Process

Submissions reviewed by editorial board

Published as is or comments sent to author

Online Peer Review process

Data supported include web movies, images, live spectra etc.

Micro- and Nano-publications Blogs, wiki entries and even Amazon book reviews

are micro/nano-publications

ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume

Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data.

ChemSpider : Spectra Linked

Spectra Linked

Spectra Linked

Not Just NMR Data

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

Spectral Game

Increasing Complexity

Spectral Game

ChemSpider Content

ChemSpider is a container…supports multimedia Spectra Crystal structures Images MP3s Videos

Roses’ Crystal Image Collection

MP3s and Videos : Titanium

Periodic Table Images

How Can You Help ChemSpider?

Deposit your data and share with the community Structures – one or many Spectra Links Syntheses into SyntheticPages

Curate data – most basic level…just add comments

Spread the word – ChemSpider is an untapped resource

Community Contribution

We can make a bigger contribution to the community if the community shares via ChemSpider

Don’t underestimate what others will find of value

ChemSpider wins “Communitycontribution” best practice award”

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Thank you

antony.williams@chemspider.comTwitter: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams

top related