chemspider - building a crowdsourced chemical database for the chemistry community

112
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community OpenSciNY, New York, May 2010,

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

1.989 views

Category:

Technology


3 download

DESCRIPTION

This is the presentation I gave at OpenSciNY 2010. It was a great gathering of Librarians and people interested in Open Science. Sharing the stage with Beth Brown Jean-Claude Bradley and Heather Joseph was, as usual, a good opportunity to discuss how openness and online data sharing is changing the way we access and share data. We live in interesting and exciting times.

TRANSCRIPT

Page 1: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community OpenSciNY, New York, May 2010,

Page 2: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Once Upon a Time Over a “Coffee”

Page 3: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Which is better for Plants?Vodka, Sprite or Viagra?

Page 4: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

It Works – Viagra Wins the Day

Page 5: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Now Which is Better?

Viagra or Cialis?

Images sourced from Wikipedia

Page 6: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Cialis

I want…The structureAny patent informationRelated publicationsWhere can I buy it?Metabolic pathway infoWhat else is easy to find…

Page 7: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Cialis on Google?

Page 8: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What is Cialis?

Page 9: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What is Cialis? Can we trust Wikipedia?

Page 10: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What is Cialis?

6 hits on PubChem

Page 11: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What is Cialis?

Page 12: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Search by Trade Name

Page 13: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Are there other names???

Page 14: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Are there other names???

PubMed hits: 736 Tadalafil 744 Cialis

Page 15: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Are there other names???

Page 16: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Are There Other Names?

Page 17: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

IC351 on PubChem?

5 HITS for IC351

ZERO HITS for IC 351

Page 18: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Chemistry on the Web

Text searching the web is far from optimal

The quality of data on the web is a problem

It may be hard to find but it is “out there”

What was once locked up behind an expensive license can generally be found

Structure searching the web is already possible!

Page 19: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Text Searching the Web

Text searching the web for chemical compounds is an enormous challenge

RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?

Page 20: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

The RSC Publishing Platform (Beta)

Page 21: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

2+2 = 4 Articles?

Page 22: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

CAS Number Search

Page 23: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Text Searching the Web

Disambiguation dictionaries of name-structure relationships would be very enabling.

IC351 = IC 351 = Tadalafil = Cialis = …

Creating validated dictionaries is an enormous challenge to cover chemistry

Page 24: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

CAS Registry – LOTS of Chemicals!

Page 25: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community
Page 26: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community
Page 27: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

The Final Search StrategyA “Disambiguation Query!”

Page 28: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

All Those Names, One StructureA problem to solve…

Page 29: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

ChemSpider - A Pragmatic Vision

“Build a Structure Centric Community toServe Chemists”

Aggregate and integrate chemical structure data on the web – names, structures, links

Create a “structure-based hub” to information, data and algorithmic predictions

Let chemists contribute their own data Allow the community to curate/correct data

Page 30: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

media.obsessable.com

As few interfaces as possible

What do humans want?

Page 31: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Aggregating Data – Who to Trust??? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 32: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Just “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors

Page 33: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Question Everything online: www.dhmo.org

Page 34: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Di-Hydrogen Monoxide

2H

Page 35: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Di-Hydrogen Monoxide

2H + 1O

Page 36: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Di-Hydrogen Monoxide

H2O

Page 37: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Di-Hydrogen Monoxide

H2OWater

Page 38: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

It’s all on Wikipedia…

Page 39: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What About Gases? Methane…

Page 40: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What’s Methane?

Page 41: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What’s Methane?

Page 42: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

What ELSE is Methane???

Page 43: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Structural Data for Life SciencesDailyMed

Page 44: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Lack of Stereochemisty

Page 45: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Incorrect Structures

Page 46: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Pragmatic Vision Delivered…

Aggregate, integrate and link data from across the internet

Almost 25 million structures from >300 data sources

Linked to vendors, literature, online databases (open and commercial), open notebook science, patents and….

Robotic and Crowdsourced Curation

Page 47: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Search “OEA”

Page 48: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Search OEA

Page 49: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Search OEA

Page 50: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Search OEA

Page 51: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Linked Patents for OEA

Page 52: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Answering Questions…

Questions a student might ask… What is the structure of levulinic acid? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? How can I synthesize 2,4-dichlorophenol? What are the safety handling issues for Thymol Blue?

Page 53: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Back to Cialis…

Page 54: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Cialis on ChemSpider : 1 hit

Chemicals are curated/validated on ChemSpider by ourselves and the community

Based on assertions from various sources. Iterative, time-consuming and exacting!

We believe we know the structure now

What is linked and available?

Page 55: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Google Patents

Page 56: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

ChemSpider – Patents Linked

SURECHEM PATENTS GOOGLE

Page 57: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Google Books

Page 58: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Microsoft Academic Search

Page 59: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Google Scholar – Articles were found by CAS Number!

Page 60: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Identifiers for Tadalafil

Page 61: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

How Many Articles in RSC Journals?

Based on 171596-29 -5 there are 13 articles in RSC journals

What about if we VALIDATE identifiers?

Page 62: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Validated Dictionaries Hit APIsThis is data curation...

Page 63: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Does this generate more results?

Page 64: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

RSC Journals

Page 65: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

RSC Journals

REMEMBER 2+2 = 4

Page 66: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

PubMed

Page 67: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Google Scholar – Expanded Hit Set

Page 68: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Microsoft Academic Search

Page 69: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Microsoft Academic Search

Be careful! More mussels than drugs…

Page 70: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Searching Chemistry on the Internet

Do we get complete a result set will we get if we search for “chemicals” only by name?

Is there a better way to link chemistry databases? Linking by “names” is dangerous

Chemists want structure and SUBstructure searching

Page 71: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Structure Searching the Web

We have resources about Tadalafil actively linked to ChemSpider

What about searching the web for Tadalafil by structure…not based on the various identifiers

How?

Page 72: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Link the Internet with InChIKeys!

Taken from: Rafael Sidis’ Blog

Page 73: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

The InChI Identifier

Page 74: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Multiple Layers

Page 75: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

InChIStrings Hash to InChIKeys

Page 76: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Cialis – Searching the Web by InChI

Search Molecular SKELETON

Search Full Molecule

Page 77: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

InChI Search the Web by Skeleton78 Hits by Skeleton

Page 78: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

InChI Search the Web Exact Match32 Hits by InChIKey

Page 79: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

InChI Search the Web Exact Match6 Hits by Standard InChIKey

Page 80: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

InChifying the Web

There are more than 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?

Our judgment…MISTAKES

Page 81: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Vancomycin – Search the Internet

Page 82: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Full Molecule Search: 4 Hits

Page 83: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Full Skeleton Search: 104 Hits

Page 84: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

But what is the structure???

Page 85: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

We need an InChI “Resolver”

Page 86: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

InChI Resolver to DOIsStructure Search the Web

Page 87: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Semantic Markup: Project Prospect

Page 88: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Depends on Validated Dictionaries

Link to a Structure or the Right Structure?

Page 89: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Name-Structure Pairs

Page 90: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything” Through ChemSpider!

Page 91: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Unpublished Chemistry

Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found

Page 92: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Org Prep Daily (Blog)

Page 93: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

ChemSpider SyntheticPages

Page 94: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Submission process Register as a user Use the Submit button and fill in the fields…

Page 95: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Submission Process

Submissions reviewed by editorial board

Published as is or comments sent to author

Online Peer Review process

Data supported include web movies, images, live spectra etc.

Page 96: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Micro- and Nano-publications Blogs, wiki entries and even Amazon book reviews

are micro/nano-publications

ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume

Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data.

Page 97: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

ChemSpider : Spectra Linked

Page 98: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Spectra Linked

Page 99: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Spectra Linked

Page 100: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Not Just NMR Data

Page 101: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

Page 102: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Spectral Game

Page 103: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Increasing Complexity

Page 104: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Spectral Game

Page 105: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

ChemSpider Content

ChemSpider is a container…supports multimedia Spectra Crystal structures Images MP3s Videos

Page 106: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Roses’ Crystal Image Collection

Page 107: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

MP3s and Videos : Titanium

Page 108: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Periodic Table Images

Page 109: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

How Can You Help ChemSpider?

Deposit your data and share with the community Structures – one or many Spectra Links Syntheses into SyntheticPages

Curate data – most basic level…just add comments

Spread the word – ChemSpider is an untapped resource

Page 110: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Community Contribution

We can make a bigger contribution to the community if the community shares via ChemSpider

Don’t underestimate what others will find of value

ChemSpider wins “Communitycontribution” best practice award”

Page 111: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Page 112: ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community

Thank you

[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams