structure representations in public chemistry databases: the challenges of validating the chemical...
DESCRIPTION
Internet-based public domain databases containing chemical compounds have grown in number, capability and content in recent years. There are now many databases containing millions of chemical compounds associated with different types of data including chemical names, properties, analytical data, and with associated mapping to proteins, assay data, clinical information and so on. These disparate data sources suffer from one common issue – quality of data. This presentation will provide an overview of our efforts to source the appropriate structural representations for 200 top-selling drugs from public domain sources. This intra- and inter-laboratory comparison of approaches, processes and necessary agreements exposed the challenges associated with aggregating structure-based data. The project also provided data regarding the distribution of quality issues associated with many of the community’s popular databases.TRANSCRIPT
Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs
Antony WilliamsACS Denver
September 2011
Upfront Acknowledgment - All Authors…
Royal Society of Chemistry – Antony Williams, David Sharpe
University of North Carolina, Chapel Hill – Alex Tropsha, Denis Fourches, Eugene Muratov, Andrew Fant
Chemotargets SL – Ricard Garcia-Serna IMIM-Hospital del Mar Research Institute and
Universitat Pompeu Fabra – Jordi Mestres Astra Zeneca – Sorel Muresan, Christopher
Southan ACD/Labs – Andrey Erin
Internet-Based Chemistry
Internet-based chemistry resources are:
Diverse in quality Confusing Uncoordinated Fixable – with a lot of effort
Open PHACTS : partnership between European Community and EFPIA
Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.
Stop Whining – Fix it
What needs to happen?
Standards Standardization of structures
ChEBI/PubChem sharing InChI adoption
Collaboration Stop reinventing the wheel Share data, share efforts and speed the process
Vision is not good enough – Execute!
Standards : Structure Standardization
Standards : Structure Standardization
Standards : Structure Standardization
Collaboration
Then this won’t happen…
Top 200 Drugs on Wikipediahttp://en.wikipedia.org/wiki/List_of_bestselling_drugs
The Project Challenge PART ONE
Agree on the set of chemical names to work with
Independently create an SDF file in each “lab”
Compare differences and agree on final structures
Issue “Gold Standard” SDF file to team
The Project Challenge PART TWO
Use Gold Standard SDF File to investigate data quality on these compounds in Internet Databases
Two checks Search chemical name – does it return the
correct compound. If not correct, how is it different?
Search “structure” – SMILES, Molfile, InChIString or InChIKey
200 Top-Selling Drugs (2006)
Biologicals removed immediately
Single compounds versus mixtures identified
Decision to NOT exclude racemates
List of 152 drugs to analyze
Generic names used
Different Approaches
ACD/Labs – Curated commercial dictionary
RSC|ChemSpider and UNC Chapel Hill – manual curation
ChemoTargets/IMIM – lookup against database
AstraZeneca – lookup against database
Different Approaches
Different Approaches
Different Approaches
Different Approaches
Choose a Starting Point
Comparisons
Observations
Manual curation – slow and imperfect process. A loop of assertions Software tool issues
Lookup – fast and imperfect Totally dependent on initial investment in time
InChIs Very useful for comparison Imperfect
Structure Representations
Representing Racemates
Representing Racemates - Formoterol
Racemic Mixtures
Racemic Mixtures
X
“The First 10”
Collaboration on Curation If we could collaborate on curation…share through
standards and open interfaces
Proof of Concept Data Curation Sharing
SciDBs.com (Coming soon)
Conclusions It is DIFFICULT to aggregate high quality structure
datasets of even common drugs! InChI is very enabling but enhanced stereo necessary Is there a need to be “right”?
Publication will provide: Recommendations for structure standardization Rank ordering of resources Suggestions for InChI enhancement SDF file Curation feed of structures and synonyms
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams