chemspider - does community engagement work to build a quality online resource for chemists?

89
ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists? Antony Williams ACS Denver August 30th 2011

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

2.367 views

Category:

Technology


4 download

DESCRIPTION

With an intention to provide a high quality free internet resource of chemistry related data for the community, ChemSpider has aggregated almost 25 million compounds linked out to over 400 data sources and provided a platform for the community to both deposit and curate data. This experiment in crowdsourcing for chemistry has now been running for over three years. This presentation will review a number of aspects of the project including (a) the level of community participation in depositing and curating data; (b) the nature of data and content supplied by the community; (c) how ChemSpider is used by the community; (d) using game-based systems to assist in data curation; (e) algorithmic-based approaches to data validation and filtering; and (f) sharing data curation efforts with other online databases.

TRANSCRIPT

Page 1: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Antony WilliamsACS DenverAugust 30th 2011

Page 2: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What’s said on the web is true…

Page 3: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What’s said on the web is true…

Page 4: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What’s said on the web is true…

“We then established a collaboration with professor Sum Ting Wong, a fugitive from the North Korean University Hu Yu Hai Ding, currently in Rome (Italy).”

“This was identified as the new protein Wai So Dim (WSD).”

Page 5: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Who is Sandy Lawson? Ask Google

Page 6: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Who is Sandy..to me?

Mentor in computer-generated nomenclature Educational Technologist Innovator Ethical

“Gentleman Sandy”

Page 7: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What is the Structure of Vitamin K1?

Page 8: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider

The Free Chemical Database

A central hub for chemists to source information >26 million unique chemical records Aggregated from >400 data sources Chemicals, spectra, CIF files, movies, images,

podcasts, links to patents, publications, predictions

A central hub for chemists to deposit & curate data

Page 9: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider general statements

ChemSpider : one of many important resources The “Google and Wikipedia of Chemistry” A vision of “Linking all chemistry on the internet” Most people in this room probably know about it New people discover us regularly

Our distinct roles are: Hosting and exposing data for the community Curating and validating chemistry-related data

Page 10: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

I want to know about “Vincristine”

Page 11: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

I want to know about “Vincristine”

If all algorithms work then everything on the page is correct by default except the name!

Page 12: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Vincristine: Identifiers and Properties

Page 13: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Vincristine: Identifiers and Properties

Page 14: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Vincristine: Vendors and Sources

Page 15: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Vincristine: Patents

Page 16: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Vincristine: Articles

Page 17: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Searches: The INTERNET

All ChemSpider and Internet searches are “simply algorithms” but synonym searching is based on an assertion

Page 18: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

InChIs

Page 19: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Validated Names for Searching…

Page 20: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What you might not know about Chemistry Databases on the Internet Data-sharing between the databases is cyclic –

proliferating errors – “Linked Data”

Page 21: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What you might not know about Chemistry Databases on the Internet Some public databases are “trusted” as primary

sources

Trust is granted without investigation or understanding of the content

Page 22: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.

Page 23: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What you might not know about Chemistry Databases on the Internet Some public databases are “trusted” as primary

sources.

Page 24: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What you might not know about Chemistry Databases on the Internet Some public databases are “trusted” as primary

sources

Trust is granted without investigation or understanding of the content

What do we know about some of the online resources?

Page 25: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

PHYSPROP Database

The freely downloadable database under the EPI Suite prediction software

Very Basic filters suggest data quality issues

Page 26: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

The Stereochemistry challenge.12500 chemicals with “missed” stereo

Page 27: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

NIST Webbook

Page 28: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

PubChem

Page 29: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What you might not know about Chemistry Databases on the Internet Make sure you blame the database hosts!!! (???)

Errors are primarily deposited and inherited by the data suppliers

Chemistry databases depend enormously on structure representations…

Page 30: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?
Page 31: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?
Page 32: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?
Page 33: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

What you might not know about Chemistry Databases on the Internet

Despite all of the blog posts, lectures, presentations and pleas it’s not improving

Page 34: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

NPC Browser http://tripod.nih.gov/npc/

Page 35: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

NPC Browser http://tripod.nih.gov/npc/

Page 36: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

NPC Browser http://tripod.nih.gov/npc/

Page 37: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

NPC Browser http://tripod.nih.gov/npc/

Page 38: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Patents

Page 39: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Patents

Page 40: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

WYSIWYG compounds

Page 41: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

WYSIWYG compounds

Page 42: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

But Chemspider is curated right?

Page 43: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine

Page 44: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

All aggegators suffer dilution!

Page 45: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Data Curation…long torturous task

Data curation – JUST structure-name validation is a long, torturous, iterative task.

How about validating “data” – PhysChem data such as logP data, boiling points, melting points, spectra

Page 46: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Curating Melting Point Datahttp://tinyurl.com/3e44vbx

Page 47: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Melting Point Validation Work

Page 48: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Some melting points can’t be resolved only with literature: 4-benzyltoluene

Page 49: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Data Curation…long torturous task

Data curation – JUST structure-name validation is a long, torturous, iterative task.

How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra

The crowd in crowdsourcing is …generally small

Which of the large databases are doing careful curation. How can we share the workload? Hmm..

Page 50: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider can “do it” for us

ChemSpider provides a curation interface

All curation activities are available for review, online immediately, iteratively checked

Curators have different abilities based on their profile: There are only a few “Master Curators”.

Can we “share” the curation workload?

Page 51: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Page 52: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Proof of Concept Data Curation Sharing

Page 53: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Structure Validation using feed

Look for approved synonyms

Compare feed InChIKey with database InChIKey

If different, flag for inspection

Page 54: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Who will participate???

Page 55: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Batch Validation Also Works!

Batch validation of name-structure relationships

“Background Processing framework”

Hexamethylchickenwire Chloride = C12H23O5

Page 56: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Batch Validation Also Works!

Batch validation of name-structure relationships

“Background Processing framework”

Hexamethylchickenwire Chloride = C12H23O5

Page 57: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Batch Validation Also Works!

Batch validation of name-structure relationships

“Background Processing framework”

Hexamethylchickenwire Chloride = C12H23O5

Define set of synonym filters and process the entire backfile. We will use synonym filters at deposition

Page 58: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Community Contribution to ChemSpider

ChemSpider as a host for community contributions Curation and validation input Structures Movies Images Analytical data – especially spectra

Page 59: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Spectra

Page 60: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

Page 61: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Spectral Game

Page 62: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Data Curation

Page 63: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Reversed Spectrum

Page 64: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Download, reprocess, redeposit

Page 65: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

True Curation of Data

Page 66: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Batch wise validation of NMR data

Page 67: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Automated C13 Verification

Page 68: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Mixture Identified

Page 69: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

NMR Verification H1 NMR: 77% of spectra consistent C13NMR: 67% of spectra consistent

Algorithms NOT perfect but did identify: Misreferenced data Reversed spectra 22 mixtures identified Signal-to-noise was poor – missing peaks

What about 2DNMR verification?

Page 70: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider ID 24528095 HHCOSY

Page 71: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider ID 24528095 HSQC

Page 72: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Crowdsourced Spectral Data

Spectral data available athttp://www.chemspider.com/spectra.aspx

Regular data depositions Generally licensed as Open Data Chemical vendors now contributing spectral data

– up to 800 spectra presently being acquired

All data welcomed – who will they benefit? www.SpectralGame.com http://spectraschool.rsc.org/

Page 73: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

SpectraSchool

Page 74: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?
Page 75: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Community Contribution to ChemSpider

ChemSpider as a host for community contributions Curation and validation input Analytical data – especially spectra Movies, images Is it just structures?

ChemSpider SyntheticPages as a host for reaction syntheses

Page 76: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider SyntheticPages

Page 77: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider SyntheticPages

Page 78: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Submission Process Simple template-based submission process

Submissions reviewed by editorial board. Published as is or comments sent to author

Online Peer Review process

Data supported include web movies, images, live spectra etc.

DOI issued to author

Page 79: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Is it working? Show of hands…

How many of you know CSSP? Have any of you submitted to CSSP?

Low submissions but some dedicated authors

Page 80: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Is it working? Show of hands…

How many of you know CSSP? Have any of you submitted to CSSP?

Low submissions but some dedicated authors

It is NOT a technology issue Students need permission to publish Publishing syntheses might prevent publication CSSP would grow if we abstracted supp. info –

templated supp info. submissions could help.

Page 81: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Crowdsourcing – does it work?

131 people EVER has either deposited or curated data on ChemSpider

ChemSpider SyntheticPages has a small group of dedicated authors

Database hosts and vendors make the largest contributions of data

ChemSpider staff do the most curation

Page 82: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

If it was not just about me…

We might have a community built encyclopedia

I might know where the best restaurants are

I might get good advice on books to read

I might know which movies to watch

I might know which plumber to call

Data might just be Open

Page 83: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

If it was not just about me…

We might have a community built encyclopedia

I might know where the best restaurants are

I might get good advice on books to read

I might know which movies to watch

I might know which plumber to call

Data might just be Open

Page 84: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

How will it improve?

Participation and

contribution

Page 85: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

RSC’s LearnChemistry:Share

Page 86: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Improved Quality of data is essential Open PHACTS : partnership between European

Community and EFPIA Freely accessible for knowledge discovery and

verification. Data on small molecules Pharmacological profiles ADMET data Biological targets and pathways Proprietary and public data sources.

Page 87: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Conclusions ChemSpider has an important role in quality data

Crowdsourced deposition, validation and curation works but low engagement to date

Primary challenge – engaging the community to help create what they want. Rewards and recognition?

MORE collaboration can benefit us all

All indicators are good for continued growth

Page 88: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Acknowledgments

The ChemSpider team

Craig Knox, DrugBank

Our data providers, depositors, collaborators and curators

Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)

Page 89: ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams