structure representations in public chemistry databases: the challenges of validating the chemical...

36
Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top- selling drugs Antony Williams ACS Denver September 2011

Upload: antony-williams-chemconnector-orcid-0000-0002-2668-4821

Post on 30-Jun-2015

2.873 views

Category:

Technology


4 download

DESCRIPTION

Internet-based public domain databases containing chemical compounds have grown in number, capability and content in recent years. There are now many databases containing millions of chemical compounds associated with different types of data including chemical names, properties, analytical data, and with associated mapping to proteins, assay data, clinical information and so on. These disparate data sources suffer from one common issue – quality of data. This presentation will provide an overview of our efforts to source the appropriate structural representations for 200 top-selling drugs from public domain sources. This intra- and inter-laboratory comparison of approaches, processes and necessary agreements exposed the challenges associated with aggregating structure-based data. The project also provided data regarding the distribution of quality issues associated with many of the community’s popular databases.

TRANSCRIPT

Page 1: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Antony WilliamsACS Denver

September 2011

Page 2: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Upfront Acknowledgment - All Authors…

Royal Society of Chemistry – Antony Williams, David Sharpe

University of North Carolina, Chapel Hill – Alex Tropsha, Denis Fourches, Eugene Muratov, Andrew Fant

Chemotargets SL – Ricard Garcia-Serna IMIM-Hospital del Mar Research Institute and

Universitat Pompeu Fabra – Jordi Mestres Astra Zeneca – Sorel Muresan, Christopher

Southan ACD/Labs – Andrey Erin

Page 3: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Internet-Based Chemistry

Internet-based chemistry resources are:

Diverse in quality Confusing Uncoordinated Fixable – with a lot of effort

Page 4: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs
Page 5: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Open PHACTS : partnership between European Community and EFPIA

Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.

Page 6: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Stop Whining – Fix it

Page 7: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

What needs to happen?

Standards Standardization of structures

ChEBI/PubChem sharing InChI adoption

Collaboration Stop reinventing the wheel Share data, share efforts and speed the process

Vision is not good enough – Execute!

Page 8: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Standards : Structure Standardization

Page 9: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Standards : Structure Standardization

Page 10: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Standards : Structure Standardization

Page 11: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Collaboration

Page 12: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Then this won’t happen…

Page 13: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs
Page 14: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Top 200 Drugs on Wikipediahttp://en.wikipedia.org/wiki/List_of_bestselling_drugs

Page 15: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

The Project Challenge PART ONE

Agree on the set of chemical names to work with

Independently create an SDF file in each “lab”

Compare differences and agree on final structures

Issue “Gold Standard” SDF file to team

Page 16: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

The Project Challenge PART TWO

Use Gold Standard SDF File to investigate data quality on these compounds in Internet Databases

Two checks Search chemical name – does it return the

correct compound. If not correct, how is it different?

Search “structure” – SMILES, Molfile, InChIString or InChIKey

Page 17: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

200 Top-Selling Drugs (2006)

Biologicals removed immediately

Single compounds versus mixtures identified

Decision to NOT exclude racemates

List of 152 drugs to analyze

Generic names used

Page 18: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Different Approaches

ACD/Labs – Curated commercial dictionary

RSC|ChemSpider and UNC Chapel Hill – manual curation

ChemoTargets/IMIM – lookup against database

AstraZeneca – lookup against database

Page 19: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Different Approaches

Page 20: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Different Approaches

Page 21: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Different Approaches

Page 22: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Different Approaches

Page 23: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Choose a Starting Point

Page 24: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Comparisons

Page 25: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Observations

Manual curation – slow and imperfect process. A loop of assertions Software tool issues

Lookup – fast and imperfect Totally dependent on initial investment in time

InChIs Very useful for comparison Imperfect

Page 26: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Structure Representations

Page 27: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Representing Racemates

Page 28: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Representing Racemates - Formoterol

Page 29: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Racemic Mixtures

Page 30: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Racemic Mixtures

X

Page 31: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

“The First 10”

Page 32: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Collaboration on Curation If we could collaborate on curation…share through

standards and open interfaces

Page 33: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Proof of Concept Data Curation Sharing

Page 34: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

SciDBs.com (Coming soon)

Page 35: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Conclusions It is DIFFICULT to aggregate high quality structure

datasets of even common drugs! InChI is very enabling but enhanced stereo necessary Is there a need to be “right”?

Publication will provide: Recommendations for structure standardization Rank ordering of resources Suggestions for InChI enhancement SDF file Curation feed of structures and synonyms

Page 36: Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams