mining public domain data as a basis for drug repurposing...when errors are identified hard to get...

Mining public domain data as a basis

for drug repurposing

Antony J Williams, Sean Ekins and Valery Tkachenko

ACS Philadelphia August 2012

http://tinyurl.com/d6wodsl

Drug Repurposing

Drug repurposing commonly means data reexamination also!

Lots of data mining occurs

Then more screening which creates more data..

LOTS of public databases used to examine repurposing…

A LOT of data coming online

http://4.bp.blogspot.com/-TO45ti6gll4/Ts9q3RyqSOI/AAAAAAAABa0/1wUhBa164K8/s1600/unichem.png

Interlinked on the semantic web

Where do you get your data?

Databases?

Patents?

Papers?

Your own lab?

Collaborators?

All of the above?

What is likely common to all sources? DataQuality issues. There is no perfect database.

Public Domain Databases

Our databases are a mess…

Non-curated databases are proliferating errors

We source and deposit data between databases

Original sources of errors hard to determine

Curation is time-consuming and challenging

Availability of libraries of FDA drugs

Johns Hopkins Clinical Compound library- made compounds available at cost

The FDA Drug Database

The DailyMed Database

Government Databases Should

Come With a Health Warning

Williams and Ekins, DDT, 16: 747-750 (2011)

What is Neomycin?

Not this…

Substructure # of

Hits

# of

Correct

Hits

No

stereochemistry

Incomplete

Stereochemistry

Complete but

incorrect

stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Williams, Ekins and Tkachenko

Drug Disc Today 17: 685-701 (2012)

Data Errors in the NPC Browser: Analysis of Steroids

Drug Disambiguation Project

NCATS Discovering “New Therapeutic

Uses for Existing Molecules”

58 Molecule names

and identifiers. Where

are the “structures”?

NCATS dataset• Several groups tried to collate molecules

• Chris Lipinski provided approximately 30 unique molecules

• Simple molecule descriptors shows no difference between

compounds classified as discontinued (N= 15) or those in

clinical trials (n = 14).

• Where is the definitive set of publicly accessible molecules

for computational repurposing and analysis?

Drug structure quality is important..

Many groups ARE doing in silico repositioning

Integrating or using sets of FDA drugs..and if structures are incorrect predictions will be

Where is the definitive set of FDA approved drugs with correct structures?

Ideally we need linkage between in vitro data and clinical data

We have a problem…

Lots of data available but quality is suspect

Errors proliferate database to database

Data continues to flow in unabated

When errors are identified hard to get fixed!

Data licensing is confusing – “Open Data”

We are “takers” not “givers” mostly…

Standards are lacking:

Data licensing

Data processing – structure standardization

• Let’s agree collaboration and crowdsourcing

can help

• Provide SIMPLE ways to provide feedback

• Contribute when possible – databases should

provide feedback mechanisms

• Adopt standards for structure handling and

representation

• Adopt standards for data interchange

• Allow machine handling of data – use the

power of the semantic web

So what needs to happen to improve?

Williams, Ekins and Tkachenko, Drug Disc Today 17: 685-701 (2012)

Collaboration on Curation

Collaborate on curation…share through standards and open interfaces

All DBs should take comments!

Standardize

Use the SRS as guidance for standardization

“Appify” curation and collaboration

• The data network is complex

• “Appify” collaboration and

curation networks

• Increasing crowdsourcing role

for data analysis

Ekins & Williams, Pharm Res, 27: 393-395, 2010.

Mobile Apps for Drug Discovery

Open Drug Discovery Teams

Free iOS app used to expose repurposing data

All of this data has been tweeted http://tinyurl.com/6l9qy4f

Ekins, Clark and Williams, Mol Informatics, in Press 2012

Open Drug Discovery Teams

Gather stakeholders. Decide if goals are primarily scientific, commercial or mixed.

Explore benefits of open licensing and drawbacks of enclosure. Hold closely to open definitions and standards. Do not write your own IP licenses!

Provide simple explanations for terms of use. Use metadata to indicate licensing terms explicitly - the Creative Commons Rights Expression Language is a good tool.

Do not lock up metadata. If you can’t make the data public domain, make the metadata public domain.

Simple Rules for licensing “open” data

Williams, Wilbanks and Ekins.

PLoS Comput. Biol. in Press Sept.2012

Open PHACTS Project Develop a set of robust standards…

Implement the standards in a semantic integration hub

Deliver services to support drug discovery programs in pharma and public domain

22 partners, 8 pharmaceutical companies, 3 biotechs

36 months project

Guiding principle is open access, open usage, open source

- Key to standards adoption -

To facilitate THIS process!

What’s the

structure?

Are they in

our file?

What’s

similar?

What’s the

target?Pharmacology

data?

Known

Pathways?

Working On

Now?Connections

to disease?

Expressed in

right cell type?

Competitors?

IP?

It’s not JUST structures of course…

Taxol: Paclitaxel Bioassay Data

Most Bioassay data associated with structure with one ambiguous stereocenter

Hydrophobic

features (HPF)

Hydrogen

bond acceptor

(HBA)

Hydrogen

bond donor

(HBD)

Observed vs.

predicted IC50

r

Acoustic mediated process 2 1 1 0.92

Disposable tip mediated process 0 2 1 0.80

Data from 2 AstraZeneca patents - Ephrin pharmacophores

developed using data for 14 compounds with IC50. Different

dispensing methods give different results. Impact

hypotheses and could impact drug discovery.

Ekins, Olechno and Williams, Submitted 2012

Acoustic Disposable tip

Measuring data: dispensing dependencies

Acoustically-derived IC50 values were 1.5 to 276.5-fold

lower than for tip-based dispensing

• Pharmacophores and other computational models are used

to guide medicinal chemistry.

• Non tip-based methods may improve HTS results and avoid

misleading computational and statistical models.

• No analysis of influence of dispensing processes on data.

• Public databases should annotate metadata to create larger

datasets for comparing different computational methods.

How much data is reproducible, accurate, valid? The

challenge of high-throughput science.

Measuring data: dispensing dependencies

Conclusions

Acknowledgments

Sean Ekins

Christopher Lipinski

Joe Olechno

John Wilbanks

Drug Disambiguation project team

RSC Cheminformatics Team

Thank you

Email: [email protected]

Twitter: @chemconnector

Blog: www.chemconnector.com

SLIDES: www.slideshare.net/AntonyWilliams

Email: [email protected]: collabchemBlog: http://www.collabchem.com/

http://www.chemconnector.com/

http://www.slideshare.net/AntonyWilliams

http://www.collabchem.com/

mining public domain data as a basis for drug repurposing...when errors are identified hard to get...

Documents