whitepaper big data, wider mindset€¦ · 5. first in class drugs percentage of new molecular...

16
Big Data, Wider Mindset AN EXPANDED PERSPECTIVE FOR MEDICINAL CHEMISTRY Systems-based approaches to pharmaceutical development may uncover new drugs, but require a paradigm shiſt in how data are used. PHARMA & LIFE SCIENCES WHITEPAPER

Upload: others

Post on 06-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

Big Data, Wider Mindset

AN EXPANDED PERSPECTIVE FOR MEDICINAL CHEMISTRY Systems-based approaches to pharmaceutical development may uncover new drugs, but require a paradigm shift in how data are used.

PHARMA & LIFE SCIENCES

WHITEPAPER

Page 2: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

2

Pharmacogenomic information ultimately serves to assist physicians in matching a therapy to a patient.

THE ROLE OF PHARMACOGENIC INFORMATION IN MEDICINE

As of May 2015, the U.S. Food and Drug Administration website listed 139 FDA-approved drugs as having pharmacogenomic information in their labeling (Figure 1). This information is provided for different reasons. In some instances, it highlights variants of proteins in the body that impact how a compound is metabolized, thereby reducing drug efficacy or leading to adverse effects. In other instances, the information is a measurable characteristic that identifies patients who are especially responsive to the drug’s mechanism of action. Regardless of indication, the pharmacogenomic information ultimately serves to assist physicians in matching a therapy to a patient.

Time to think big: an expansion of the exploratory space for drug discovery must accompany the use of big data in drug development.

Page 3: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

3

Incorporation of genomic, proteomic or other ‘omics data—big data—into the development and indication of a drug is evidence of the expanding knowledge informing the creation of pharmaceuticals with the goal of maximizing their therapeutic impact. Coupled with new, streamlined approval processes, the result has been a small but noticeable surge in effective drugs entering the clinical setting and welcome therapy alternatives in especially difficult medical areas. This movement in pharmaceutical development is in its infancy but it hints at the mostly untapped potential latent in the diversity of ‘omics data exponentially accumulating from the work of scientist in academia, research institutes and the pharmaceutical industry. The question is how to leverage this information to enhance productivity and success in pharmaceutical research and development.

Integrating more effective use of big data into drug development will necessitate an overhaul of methodologies and techniques designed to process, visualize, analyze and interpret large amounts of highly complex data. This is a large challenge in and of itself. Dr. Herbert Köppen, a veteran of the pharmaceutical industry, explains that, “instigating change in a highly industrialized process with standardized workflows, deterministic task assignment, and narrowly defined outcome requirements is quite difficult.” Nevertheless, although implementation will require intensive work, the technologies and methodologies exist. Equally important, however, will be a conceptual overhaul. The use of big data must be accompanied by an expansion of the exploratory space for drug discovery, design and optimization, and this will conversely demand a shift in the way developers interact with data and apply what is learned.

Figure 1. Number of FDA-approved drugs in different medical areas that include pharmacogenomic biomarkers in their labeling. The list excludes biologics and biomarkers used exclusively for diagnostics. Some drugs are used in more than one medical area (1).

Medical area Number of FDA-approved drugs

Anesthesiology and analgesics

Cardiology and Hematology

Dental

Dermatology

Endocrinology

Gastroenterology

Genitourinary

Gynecology

Inborn errors of metabolism

Infectious disease

Neurology

Oncology

Pulmonary disease

Psychiatry

Rheumatology

Toxicology

Transplantation

2

14

1

2

7

8

1

1

3

16

9

41

3

24

5

1

1

Page 4: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

4

A focused view on one interaction

To date, therapeutic compounds continue to emerge from target-oriented development approaches that quarantine lead discovery, design and optimization from the plethora of molecular interactions of the physiological theater where a drug exerts its effect. Focused on the isolated interaction between a compound and its molecular target, these

approaches are often blind to the dynamics that arise from a therapeutic drug interacting with several biological molecules, or from biological molecules interacting with one another, or from variable microenvironmental conditions; all of which can have repercussions for the action of a drug.

Target-oriented approaches in drug development contributed to a detailed understanding of drug–target binding, but had little impact on pharmaceutical innovation.

10

20

30

40

50

60

10

20

30

40

50

60

1980 1990 2000 2010 2014

Num

ber o

f app

rove

d ne

w m

olec

ular

ent

ities

(NM

E)

Billion U

S Dollars

Hum

an g

enom

e co

mpl

eted

Min

ing

DN

A se

quen

ce li

brar

ies

Reco

mbi

nant

DN

A

Figure 2. Number of new molecular entities (and new biological entities starting in 2004) approved by the U.S. FDA each year from 1980 to 2014. Superimposed on the graph is the annual R&D expenditure reported by US PhRMA company members (orange line) and time points when technologies supporting target-oriented approaches to drug development became available (arrows). The outlier peak in NME approvals observed in 1996 stem from the review of backlogged FDA submissions after an additional 600 new drug reviewers and support staff were hired, funded by the Perscription Drug User Fee Act. Data extracted from the U.S. FDA website, DiMasi et al. 1991, and Statista (4-6).

Page 5: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

5

First in class drugs

Percentage of new molecular entities (NME)

23% (17)

37% (28)

75

51% (83)

18% (30)

157

From target-oriented screenings

From target-agnostic screenings

Total

From target-oriented screenings

From target-agnostic screenings

Total

Followers drugs

Analyses of historical pharmaceutical output often argue that the application of target-oriented strategies in drug development has not led to increased productivity in the pharmaceutical industry (2). The number of new molecular entities (NME) and new biological entities (NBE) approved every year by the U.S. FDA has remained relatively flat since 1980, with the exception of a brief peak after 1996, when the FDA hired 600 new drug reviewers and support staff to process a large backlog of review submissions (3). The introduction of technologies and knowledge to support target-oriented lead generation, such as recombinant DNA technologies in the 1980s, the mining of DNA sequence libraries in the 1990s, and the sequence of the human genome completed in 2003, had little impact on this steady output (2) (Figure 2).

Others argue that target-focused approaches make fewer contributions to pharmaceutical innovation. A study reported by Forbes (7) highlights that over 20% of circa 1000 active oncology drug programs in 2012 concentrated on the same 8 targets: mTOR, c-MET, VEGF, c-Kit, PDGF, PI3K, HER2 and EGFR. Each of these targets had at least 24 clinical trials underway and several preclinical programs. The problem may lie in the difficulty of validating a target, which is a requirement for target-oriented approaches (2). As a result, focusing on

well-known targets to develop next-generation or “best in class” drugs is a preferred strategy over exploring a new, uncharacterized target. In fact, over the last decade, most “first in class” small molecule drugs emerged from compound screening strategies that were target-agnostic whereas “follower” drugs tended to come from target-oriented drug discovery (8) (Figure 3).

This is not to say that target-oriented approaches have been fruitless; quite the contrary. Examination of single drug–target pairs has fleshed out a detailed understanding of how compounds interact with molecular targets and has generated a list of structural features that impact both the binding of a compound to a target and the way a compound behaves in a physiological milieu. Also, details about the activity and role of molecules within the context of cellular function have been elucidated through similarly focused approaches that isolate two molecules and investigate their structural and functional interactions.

Nevertheless, this focused examination of single interactions has a limitation: it cannot capture all the information that impacts the effectiveness of a drug operating in a complex biological system. Consequently, the therapeutic effect of a drug may be limited by unforeseen molecular interactions beyond the

Figure 3. Percentage of FDA-approved new molecular entities (NME) that are first-in-class and follower drugs resulting from target-agnostic phenotypic screenings versus target-based screening. Remaining numbers are accounted for by biologics and modified natural substances Data extracted from Swinney and Anthony, 2011 (8).

“First in class” drugs are therapeutic compounds that are truly novel to the market; for example, a drug that uses a unique mechanism of action to exert its therapeutic effect. Such drugs grant the developer initial exclusivity in the market but carry the risk that the drug will not work in humans, will not be better than existing therapies, or has adverse effects that can only be uncovered with time.

“Best in class” drugs are “follower” drugs that build on a therapy that is already in the clinic and has been proven to work in patients, with the aim to deliver a better therapeutic effect than the original drug. The risk in developing these drugs is not that they may not work in patients but that they may not be an improvement.

Page 6: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

6Elsevier R&D Solutions

drug–target pair. More importantly, the opportunities to uncover new molecular entities and innovative modes of action may be limited by the narrow focus on just one interaction. A trend toward an expanded exploration scope has emerged with two relatively young disciplines, systems biology and systems

chemistry. Applied to drug discovery and development, systems-based approaches use large amounts of high-quality data to predict interactions among biological molecules and compounds that may be relevant for the development of new therapies.

A disconnect of scope

Any one cell has thousands of networks of interacting molecules that regulate cellular metabolism, growth, reproduction and other critical survival functions. Normal cell phenotype—the sum of everything a cell does—depends on the meticulous orchestration of these so-called signaling pathways, which help a cell respond to its immediate environment. The sum of the phenotypes of all cells in a body determines the internal physiological state of that person: exact chemistry of the blood, function of tissues, coordinated activity of organs, etc. Imbalances at any level can lead to disease.

With an incomplete picture of the complex mechanisms that lead to disease, drug development often operates under the assumption that a drug that decommissions a molecule in a molecular network causing an aberrant phenotype can taper or eliminate the disease. Thus, a target-oriented development program looks for compounds that bind to a selected target molecule and works on altering the structure of a few winning choices to improve their affinity—the strength with which the compound binds the target—and their efficacy—the compound’s ability to initiate a response, such as inhibiting the action of the target. Assays are developed to specifically measure these characteristics of compounds. For example, recombinant biology techniques allow the target to be produced in sufficient quantities for standardized assays that measure how a compound alters the activity of the target. Also, the three-dimensional structure of the target or a relevant portion thereof can be constructed to conduct in silico simulations of binding. Through multiple iterations of the drug design cycle—test a candidate compound, make

changes to the structure to improve its affinity and/or efficacy, test it again—a drug candidate emerges which is then assessed in preclinical models, such as cell cultures or animals, and then in clinical trials on humans. This latter step represents a disconnect of scope between the generation of a lead and its testing in organisms. The first operates at the level of a single molecular interaction; the latter tests the drug candidate in the context of a full-fledged organism with all the complexity of a biological system.

Does the isolated interaction between target and lead compound play out the same way within a complex and interlinked network of biological molecules? What environmental conditions will hinder the drug on its path from route of administration to target? Zooming out from the single drug–target interaction to include knowledge about relevant molecular networks and influential parameters at cellular, tissue, and even organism level can support new models that provide answers to these questions. There are still large gaps in the basic understanding of which and how molecules interact within a given network, cell, or tissue. Nevertheless, data are accumulating that can elucidate the nature of broad-scale molecular interactions and describe how those interactions contribute to health or disease. As knowledge grows, zooming out from the single compound–target pair to complement target-oriented approaches with an expanded exploration of complete systems can lead to new therapies.

Page 7: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

7

Each interaction dimension of the exploratory space for medicinal chemistry has the potential to mitigate the disconnect between the development and utility of a drug and reveal innovative therapeutic approaches.

AN EXPANDED PERSPECTIVE FOR MEDICINAL CHEMISTRY

The exploratory space of medicinal chemistry encompasses the interaction between a biological landscape containing all druggable targets relevant to a therapeutic area and a chemical landscape containing all compounds that modulate the behavior of those targets. In an ideal world, the complete collection of targets and compounds would be known, but in reality only portions of both landscapes are characterized and even smaller subsets of compounds and targets are used in drug development.

Target-oriented approaches extract one target from the biological landscape and explore all the compounds that bind the target to identify and then optimize a lead with maximum specificity—the compound preferentially binds one target—and potency—the compound

binds that target strongly and has a significant impact on target activity. In essence, this is a one-dimensional exploration of target promiscuity (highlighted portion of Figure 4). The practical advantage of this approach is that it pares down the exploratory space relevant to the development program to a simple, linear and tractable link between disease and therapeutic drug. This link, furthermore, can be broken down to a set of well-understood design criteria and well-understood measures of drug impact that are evaluated in a systematic and iterative drug design cycle. This tractability is at the core of the industrialized workflows characteristic of the competitive arena in which pharmaceutical research and development unfolds.

Figure 4. The exploratory space available to medicinal chemistry includes compounds and targets that interact in multiple dimensions. The most commonly explored dimension is target promiscuity: which compounds bind to a target? Equally valuable for exploration are the interaction dimensions encompassing compound promiscuity (which targets does a compound bind?), compound–compound interactions, and interactions among targets as well as other molecules present in the microenvironment of a target.

Exploratory space of target - oriented approaches

Interactions with microenvironment

Compound-compound

interactions Target–targetinteractions

(Biological networks)

Target promiscuity

Compound promiscuity

Page 8: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

8

There are, however, three other interaction dimensions of the exploratory space available to medicinal chemistry—compound promiscuity, functional correlations among targets, and functional correlations among compounds (Figure 4). Each of these dimensions has the potential to mitigate the disconnect between the development of a drug and the utility of that drug by better approximating the complexity of disease mechanisms and the internal physiological environment of a patient. Each of these dimensions also has the potential to reveal innovative therapies currently out of the reach because target-oriented development approaches point to a narrow subset of drug action mechanisms.

To begin with, compound promiscuity is likely more prevalent than commonly believed. Despite being developed to bind and modulate a single target, FDA-approved drugs have been shown to interact with an average of six molecular targets ( 10). Capitalizing on that information may lead to much needed multi-target drugs that can curb resistance development. Second, targets operate within the context of molecular networks and their response to a drug can be modulated by other components of the network. For example, molecular crosstalk between tumor cells and surrounding non-tumor cells contributes significantly to tumor growth, metastasis and resistance to therapy (11). This means that heterogeneity of cells

within a tumor and in the environment immediately surrounding the tumor plays a potentially critical role in drug efficacy. Furthermore, novel therapeutic mechanisms may arise from modulating this crosstalk. Finally, complementing the goal to create multi-target drugs are combinatorial therapies where two or more compounds are combined to generate a polypharmacological effect that is more effective than hitting a single target. Creating such therapies requires examining interactions and correlated effects of compound combinations.

These three interaction dimensions—compound promiscuity, functional correlations among targets, and functional correlations among compounds—are taken into account in target-oriented drug design to some extent. For example, location of a target within a pathway is a relevant factor in target identification. Compound promiscuity is important for toxicology assessments. However, directly incorporating data and interpretation of these dimensions into the actual discovery, design and optimization of a lead itself has the potential to open new areas of exploration for effective therapies. Systems biology and systems chemistry are developing the tools and methods to leverage massive sets of ‘omics data to explore the complete informational space relevant to medicinal chemistry.

Target promiscuity is the susceptibility of a biological molecule to bind and be affected by more than one compound.

Compound promiscuity is the ability of a compound to specifically interact with more than one biological molecule.

Functional correlations among targets refers to any interaction among biological molecules that influences the effect of a compound; e.g., networks that create functional redundancy, interactions with surrounding non-target molecules relevant to microenvironment.

Functional correlations among compounds are compound–compound interactions that impact their potential therapeutic effect.

Page 9: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

9

Systems biology and systems chemistry enable a holistic view of a biological system that is more than the sum of its molecular components.

LEVERAGING EMERGENT PROPERTIES TO EXPAND HORIZONS

Systems biology and systems chemistry strive to capture the complexity of biological entities and their interactions with chemicals, and then capitalize on the properties that emerge from these interconnected systems. In this way, they enable a holistic view of an organism, tissue, cell or network of biological molecules (systems) responding to perturbations triggered by one or more chemicals that is much more than the sum of its components (12). Both are data-driven sciences with methodologies to summarize, visualize and interpret data across multiple scales (e.g., molecules, networks of molecules, cells, immediate surroundings of cells, tissues, an organism) and concepts (e.g., proteomics, genomics, epigenomics, metabolomics).

Critical to the predictive power of systems biology and chemistry is the use of large amounts of high-quality data.

Systems-based approaches can complement conventional target-oriented approaches to decipher the complexity of molecular interactions within the context of heterogeneous microenvironments. In doing so, these methodologies offer a novel perspective that can lead to innovative therapeutic mechanisms (not just single targets), more sophisticated characterization of patients to better match them to therapies and combination therapies with greater efficacy or a means to bypass drug resistance (12).

Mechanistic models of biological signaling pathways

With prior knowledge of the identity and interactions between molecules that constitute a biological signaling pathway, a mechanistic model of the pathway can be used to simulate and understand dynamic properties that emerge from the interactions and the contribution of each component to overall pathway activity. Common mechanistic models describe each biological relationship between the components of a pathway with a differential equation that accounts for the components involved in the relationship and the kinetic law governing their interaction. The complete pathway is then represented by a set of equations that can be solved to examine behavior of the overall pathway over time and

the functional interdependences of the components involved. Exploring the model computationally can highlight features of the pathway that are robust to perturbation (such as inhibition by a drug) as well as components that have a desired effect on the pathway if modulated. Dr. Köppen points out that good examples of such models are rare because their utility depends on “very detailed knowledge of network structure and empirically determined kinetic constants,” which are difficult to obtain. However, insights from elegant examples in the literature emphasize the benefits that efforts toward such modeling could have on drug design and optimization.

Page 10: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

10

One such example is a mechanistic model constructed by Bianconi and colleagues (13) to explore interactions between two pathways known to play a role in lung cancer. To this end, they laid out a schematic that summarizes the effect of both pathways on the activation of a protein called extracellular signal-regulated kinase, or ERK. This protein is involved in regulating the reproduction of cells and its increased activity has been implicated in multiple aspects of cancer development (Figure 5).

Using parameters derived from the literature, they formulated 18 differential equations to describe the interaction schematic. They then ran in silico experiments to observe the behavior of pathway components over time. The model predicted a rapid activation of ERK as a result of pathway stimulation that then tapered off to basal levels. The activation of ERK was significantly higher under the condition of simultaneous high expression of the genes that code for the receptors that stimulate the pathways (EGFR and IGF1R), suggesting an augmented signal whenever a large number of receptors are present and active in the network. Furthermore, tapering of activity to basal levels took nearly 3 times longer under this condition.

To test their in silico model, the authors collected clinical samples from lung cancer patients and examined expression of EGFR and IGF1R relative to disease free survival (DFS). Those patients with heightened expression of both genes had

significantly shorter DFS than other patient groups. Furthermore, patients with high co-expression and higher copy numbers of the two genes exhibited even shorter DFS. Both experimental results confirm the predictions of the model.

To examine the dynamics of the system, the authors created a second model based on two equations that described the aggregate signal of each pathway and incorporated crosstalk across pathways. That is, if the aggregate signal of one pathway exceeded a certain threshold, it stimulated the other pathway. In silico simulations based on these equations showed that the system had two steady-state points, one where the aggregate signal of both pathways was low and one where they were both high. This bistability was a qualitative property of the system, maintained over a range of parameter values as long as the crosstalk between the two pathways was high. When the threshold triggering crosstalk was increased, the upper equilibrium point disappeared. That is, reducing the crosstalk between the two pathways created a situation where perturbations to the system ultimately settled at the equilibrium point where both pathways signals are low. Although the authors do not discuss this in their analysis, this observation poses the possibility of an alternative therapeutic mechanism for lung cancer beyond inhibiting EGFR or IGF1R individually. It may be possible to mitigate the impact of EGFR and IGF1R overexpression by modulating crosstalk between the two pathways.

Figure 5. Schematic of the interacting EFGR and IGF1R pathways as related to lung cancer. Activation of the receptors EGFR and IGF1R recruits SOS to the receptor, where it stimulates Ras. Activated Ras binds to Raf, which ultimately causes ERK to be activated via MEK. The activated ERK sets transcription factors into action to trigger the production of proteins needed for cell proliferation and survival. Through a negative feedback loop via p90’Rsk, activated ERK decreases activation of SOS (dashed arrow). Another important negative regulation mechanism occurs through PIK3, which is stimulated by Ras, EGFR and IGF1R and enables Akt to inhibit activity of Raf. Finally, PP2A also decreases the activation of ERK. Reproduced with permission from Bianconi et al. 2012 (13).

EGFR IGF1R

SOS PIK3

Ras RasGAP

Rafp90’Rsk P’tase Akt

PP2AMEK

ERK

TranscriptionFactors

Tumor cell suvival, proliferation and

invasion

Page 11: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

11

Figure 6. Two hypothetical networks of genes described as a schematic and as adjacency matrices. A “1” in the adjacency matrix equals a connection between two nodes (genes) of the network. Thus, for example, the top line of the matrix on the right indicates that gene 1 is not connected to itself, 4, 6 and 7, but connected to 2, 3, and 5 (0 1 1 0 1 0 0). In WCNA, the adjacency matrix consists of measured correlations between pairs of genes. Genes 2 and 5 most likely belong in a cluster in the network on the right because they share many connections (1, 3 and 4). The opposite is true for the network on the left.

Summarizing big data: clustering genes

Development of a drug may begin with the exploration of gene expression profiles of paired tissue samples taken from healthy and sick patients. By comparing the two sets of profiles, it is possible to extract patterns of gene expression that may be relevant to disease mechanisms. So, if all sick patients show increased expression of a particular gene compared to healthy patients, this gene may play a causative role in the disease. Standard differential analysis of gene expression treats each gene in a dataset as an individual entity and the comparison of profiles of healthy versus sick patients calculates the statistical likelihood that the change observed in the expression of one gene is significant. That is, the output of the analysis is a list of p-values, one for each gene included in the profile.

Although insightful at the level of individual genes, this method fails to account for connections between genes. Chances are, the thousands of genes often included in these profiles can be organized into clusters with correlated expression patterns. Another drawback of standard differential gene expression analysis is that it suffers from multiple testing (14). When thousands of statistical tests are performed on a dataset, it is likely that some comparisons will appear to be statistically significant by random chance alone (i.e., false positives).

Dr. Steven Horvath and his team at the University of California, Los Angeles, developed a mathematical strategy to highlight clusters of correlated genes—known as modules—and reveal patterns that reflect network-level activity rather than changes at the level of individual network components (15). Though initially developed for gene expression data, Horvath and others have successfully applied this methodology to a variety of other data, including methylation profiles, miRNA, functional MRI data and peptide count data. The method, called

weighted correlation network analysis (WCNA), uses correlations between all genes in a dataset of expression profiles to arrange them into an adjacency matrix that reflects a coexpression network of all the genes (Figure 6). Every gene pair is then examined to assess how strongly correlated (or connected) they are to the same group of other genes, a measure that corresponds to the likelihood of the two genes belonging to the same module. In this way, the thousands of genes included in the expression profiles are arranged into a handful of modules and a summarizing parameter for each (called an eigengene) is used to compare the datasets from healthy and sick patients and assess the relevance of any given module to the disease. Furthermore, genes within a module can be annotated with known functions to identify cellular functions related to the disease, and genes with a high degree of connectivity—hub genes—within a module can be evaluated as predictive biomarkers of the disease or possible targets for therapies.

Liu et al. (16) used WCNA to reveal modules of correlated genes in expression data from 58 samples of lung cancer tumor tissue and adjacent healthy tissue. They identified one module consisting of genes involved in wound response, cell proliferation, cell adhesion and response to inflammatory stimulus, and summarized that module with a network of 15 hub genes. They tested the hub genes in 6 other datasets as predictive markers of disease, demonstrating that they could classify healthy and tumor samples with high accuracy. Of the 15 “hub” genes, 8 had been previously linked to lung cancer, 2 had been associated with other cancers and tumor angiogenesis, but 5 were new and merit further investigation to characterize them functionally and assess their potential as drug targets.

36

4

527

1

13

2

7 4

5

6Schematic of a network of genes

Corresponding adjacency matrix

0 1 0 0 0 0 01 0 1 0 1 0 10 1 0 0 0 0 10 0 0 0 1 1 00 1 0 1 0 1 00 0 0 1 1 0 00 1 1 0 0 0 0

0 1 1 0 1 0 01 0 1 1 1 0 11 1 0 0 1 0 00 1 0 0 1 0 11 1 1 1 0 1 00 0 0 0 1 0 00 1 0 1 0 0 0

2 and 5 share no connections to other genes.

2 and 5 are not in the same cluster.

2 and 5 share connections to

1, 3 and 4.2 and 5 are in the

same cluster.

Page 12: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

12

Summarizing big data: representing drug–protein interactions

In a similar vein, the discovery of novel compound–target interactions can benefit from mathematical strategies that summarize information describing known interactions between compounds and targets. Virtual screening of compounds is a common technique in lead identification that serves to narrow down large libraries of compounds to a more manageable number of candidates for further examination. Ligand-based virtual screening takes datasets of known compounds and builds a model that predicts binding based on physicochemical properties of compounds that bind a specific target versus those that do not bind it. This predictive model can then be used to evaluate the potential of novel compounds in binding that target. Structure-based virtual screening builds a three-dimensional rendition of a site on a target where compounds are known to bind, and then calculates the likelihood that a given novel compound will bind to that site based on structural characteristics. Although these screening methods have proven invaluable in streamlining lead identification, they have one shortcoming: the target used in the screening is static. As a result, outcomes of the screening are limited by the narrow or single-target scope.

Yabuuchi, et al. (17) proposed a virtual screening concept that focuses on compound–target interactions as the building blocks of a predictive model. They constructed a database of known proteins and their known interactions with compounds, where each interaction was described by a vector representation of all relevant physicochemical properties of the compound and detailed amino acid sequence patterns of the target. Then they employed machine learning to create a predictive model capable of assessing the likelihood of a successful novel compound–target interaction. That is, the predictive model was no longer restricted to a single target or target group. Instead, it was able to extract compound properties and target sequence features that predicted successful binding between an untested target and compound pair. Applying the methodology to EGFR and CDK2, both proteins implicated in cancer, they demonstrated significantly higher hit rates in subsequent bioassays compared to screening entire chemical libraries. More importantly, many of the assay hits were compounds that were structurally different from known ligands for each target. Thus, the output of the virtual screening was not only a higher number of drug candidates but also a more diverse repertoire of candidates.

Page 13: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

13

Phenotypic screening: capturing microenvironment

Phenotypic screening of compounds, a standard method in pharmaceutical research before the 1980s when recombinant DNA techniques were introduced, has regained popularity in drug lead discovery. Taking a step back from the molecular theatre of drug–target interactions, phenotypic drug discovery tests compounds on cell cultures and small organisms (e.g., worms, flies, zebrafish) to identify hits that generate a desired biological effect. This biological effect may be measured as changes in the physical properties of cells or the localization of its components, changes in the production of RNA, proteins, metabolites and other molecules, or any other measure that can be linked to the disease of interest (2).

Unlike target-directed drug discovery, targets are not known and readouts measure the result of multiple targets and pathways that are simultaneous interrogated. An advantage of phenotypic drug discovery is that it assesses the capacity of one or more compounds to modulate disease rather than just the ability to bind a target. This is a reason for the renewed interest in phenotypic screening as it bridges the disconnect between development and utility of a drug already at the point of lead discovery. Other arguments include that phenotypic screening has a greater potential to lead to novel targets, increase chemical diversity of compounds being evaluated as drug leads, and uncover novel mechanisms of action (18) (see Figure 3).

Cell-based assays have been increasingly applied in the industrialized workflows of high-throughput drug screening, though the development of such assays is challenging, time-consuming and project-specific (18). One area where phenotypic screening is particularly promising is in the assessment of microenvironment and its impact on drug action. Co-cultures of different cell types are used to evaluate if the presence of cells surrounding those to be treated with a compound impacts responsiveness. For example, it has been shown that the anti-tumor effect of gefitinib, a targeted lung cancer medicine, is attenuated by the presence of fibroblasts co-cultured with lung cancer cells (19). In fact, culturing tumor cells with stromal cells, such as fibroblasts, can cause them to arrange differently on a two-dimensional plane and in a three-dimensional matrix, which can impact their uptake of a given compound (18). Furthermore, a systematic evaluation of co-cultures using different stromal and cancer cell types demonstrated that stromal cells commonly mediate resistance to therapeutic agents (20).

Once a phenotypic screening assay is in place, the next challenge is to translate the resulting data into guidance for the optimization of leads identified in the screen. Typically, a target or possible mechanism of action is sought and a spectrum of methods is used to arrive at these. At one end of the spectrum are chromatography-based techniques where potential targets are isolated from cell extracts as they bind to the identified lead compound. This can be done by “capturing” the target as it binds to the compound immobilized to a surface or by detecting changes in thermodynamic stability that occur when the target binds the compound. Another option is to express a library of protein targets on the surface of phage particles—simple organisms that highjack replication machinery of host cells—and isolate those phage particles that bind to the identified lead compound. At the other end of the spectrum are in silico approaches that take advantage of large databases to predict potential targets or mechanisms of action. For example, databases can be mined to predict targets based on structural similarities shared between the identified lead compound and well-characterized compounds (21). Another intriguing approach has been the construction of response profile databases for assay systems. That is, a system such as an optimized cell culture assay, is exposed to a large library of known bioactive compounds for which interactions with biological molecules and/or mechanisms of action are known. Detailed response profiles are recorded and made searchable so that, by comparing experimental profiles to the database hypotheses about target and action can be generated (18).

Whether identifying targets for compound hits of phenotypic screening will be a necessity in the future, remains to be seen. Currently, subsequent optimization of the compound (and approval to pursue a compound series) relies on target-oriented approaches. However, Dr. Köppen talks about how phenotypic screening may also be a powerful tool for lead optimization, as the “system affords information about cytotoxicity, off-target effects, and more.” He continues, “Implementing phenotypic testing for drug optimization will require moving from structure–activity relationships to structure–readout relationships, where the readout is a set of relevant phenotypic parameters that reflect a meaningful manifestation of network [or pathway] function.” The aforementioned response profile databases could be a step in this direction.

Page 14: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

14

DATA, DATA AND MORE DATA

Driving the interest in expanding the exploratory space of medicinal chemistry to include a comprehensive perspective of all factors that influence the therapeutic window of a drug is the recognition that current attrition rates in pharmaceutical development are not sustainable (an estimated 80-90%; see 4, 22), and likely result from the disconnect in scope between how drugs are developed and how they are ultimately used. Every step of the drug discovery process has hurdles and entails risks. Minimizing these by narrowing this disconnect means developing drugs within the context of complex biological systems and this conversely means informing drug discovery, design and optimization with as much knowledge as possible about the various interaction dimensions that result from compounds exerting their effect on networks of biological molecules in a physiologically heterogeneous environment.

Network- or systems-based approaches can support a more comprehensive conceptual framework for drug development but they rely heavily on large amounts of high-quality data. Mechanistic models are informative only if constructed based on solid, empirically determined parameters, which must be extracted from the literature or defined. The model from Bianconi et al. (13) included 45 parameters for which they established values based on available literature. Furthermore, mechanistic models must be tested experimentally, improved based on results, and then tested again. Similarly data-intensive was the work by Liu et al. (16) who used over 1.5 Gb of raw data to construct and validate their gene module via WCNA, and by Yabuuchi et al. (17) who used over 15,000 pairs of kinases and kinase inhibitors to construct their virtual screening model and validate it with EGFR and CDK2.

Looking into the future, as the search for novel pharmacotherapies taps into this expanded exploratory space, the magnitude and diversity of data needed to elucidate interactions in complex biological systems, understand how those interactions contribute to health or disease, and ultimately identify potentially effective therapies will

grow. Effective use of data-driven methodologies will require drug developers themselves to adapt to a new reality—one that calls for better management and use of information across a range of scientific domains. Dr. Scott Lusher examines data challenges in different scientific disciplines to apply lessons learned. He explains that as more data inform the drug design cycle, data evaluation and decision-making will take longer. In addition to new ways of managing, sharing and visualizing more complex data, he explains that all data will also need to be openly available at all times to team members of a drug development program, and traceable to their source. This will demand new strategies to incentivize members to enter their data into a shared knowledge management platform and to control data accuracy. “The nature of project meetings will also need to change,” he continues. “Solid evaluation and decision-making will require the input of everyone in the team, including those generating the data. Interpretation of the data generated by team members should be done as a group, with everyone looking at the same, most up-to-date data.” Truly understanding generated data and making meaningful connections between data points will prove to be the most fruitful analysis strategy, but that will require the time and flexibility to explore different ideas and to make mistakes.

With all its challenges, Dr. Köppen sees the incorporation of systems-based approaches to drug development as inevitable. “Quite honestly,” he says, “I see no way to escape this paradigm shift. The fact is, only drugs that provide a real therapeutic benefit will pay off research and development investment, and it is clear that the single-target approach no longer meets that demand.” In response to the question of what it will take for systems biology to be used routinely in drug development, he says, “It will take one research division head to have the foresight to make an investment along this line; it will take a lot of work to validate techniques and models; and it will take time for this field, which is in its infancy, to find strong footing.” However, exploring these unchartered waters may also usher in a new era in pharmaceutical innovation.

Page 15: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

15

REFERENCES

1. U.S. Food and Drug Administration. Table of Pharmacogenomic Biomarkers in Drug Labeling. http://www.fda.gov/drugs/scienceresearch/researchareas/pharmacogenetics/ucm083378.htm. (Accessed 7 May 2015).

2. Lee, J.A. and Berg, E.L. (2013) Neoclassic drug discovery: the case for lead generation using phenotypic and functional approaches. J Biomol Screen 18, 1143-1155.

3. LaMattina, J. (2015) FDA approvals 1996 vs. 2014: The two most prolific years, but stark differences. Forbes Magazine. http://www.forbes.com/sites/johnlamattina/2015/01/07/fda-approvals-1996-vs-2014-the-two-most-prolific-years-but-stark-differences/ (Accessed 6 June 2015).

4. Hay, M., Thomas, D.W., Craighead, J.L., Economides, C. and Rosenthal, J. (2014) Clinical development success rates for investigational drugs. Nature Biotechnol 32, 40-51.

5. U.S. Food and Drug Administration. Summary of NDA Approvals & Receipts, 1938 to the present. http://www.fda.gov/AboutFDA/WhatWeDo/History/ProductRegulation/SummaryofNDAApprovalsReceipts1938tothepresent/default.htm

6. DiMasi, J.A., Hansen, R.W., Grabowski, H.G. and Lasagna, L. (1991) Cost of innovation in the pharmaceutical industry. J Health Economics 10, 107-142.

7. Statista. Spending of the U.S. pharmaceutical industry on research and development at home and abroad from 1990 to 2014 (in million U.S. dollars). http://www.statista.com/statistics/265090/us-pharmaceutical-industry-spending-on-research-and-development/ (Accessed 8 May 2015).

8. Booth, B. (2012) Cancer drug targets: the march of the lemmings. Forbes Magazine. http://www.forbes.com/sites/brucebooth/2012/06/07/cancer-drug-targets-the-march-of-the-lemmings/ (Accessed March 24, 2015).

9. Swinney, D.C. and Anthony, J. How were new medicines discovered? Nature Rev Drug Discov 10, 507-519.

10. Mestres, J., Gregori-Puigjané, E., Valverde, S. and Solé, R.V. (2009) The topology of drug-target interaction networks: implicit dependence on drug properties and target families.

11. Choi, H., Sheng, J., Gao, D., Li, F., Durrans, A., Ryu, S., Lee, S.B., Narula, N., Rafiii, S., Elemento, O., Altorki, N.K., Wong, S.T.C. And Mittal, V. (2015) Transcriptome analysis of individual stromal cell populations identifies stoma-tumor crosstalk in mouse lung cancer model. Cell Reports 10, 1187-1201.

12. Werner, H.M.J., Mills, G.B. and Ram, P.T. (2014) Cancer systems biology: a peek into the future of patient care? Nature Rev Clin Oncol 11, 167-176.

13. Bianconi, F., Baldelli, E., Ludivini, V., Crinò, L., Flacco, A. and Valigi, P. (2012) Computational model of EGFR and iGF1R pathways in lung cancer: a systems biology approach to translational oncology. Biotechnology Advances 30, 142-153.

14. Cupples, L.A., Heeren, T., Schatzkin, A. and Colton, T. (1984) Multiple testing of hypotheses in comparing two groups. Ann Intern Med 100, 122-129.

15. Bin, Z. and Horvath, S. (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4 (Article 17).

16. Liu, R., Cheng, Y., Yu, J., Lv, Q-L. and Zhou, H.-H. (2015) Identification and validation of gene module associated with lung cancer through coexpression network analysis. Gene 563, 56-62.

17. Yabuuchi, H., Nijima, S., Takematsu, H., Ida, T., Hirokawa, T., Hara, T., Ogawa, T., Minowa, Y., Tsujimoto, G. and Okuno, Y. (2011) Analysis of multiple compound protein interactions reveals novel bioactive molecules. Mol Syst Biol 7, 472.

18. Berg, E.L., Hsu, Y.-C. and Lee, J.A. (2014) Consideration of the cellular microenvironment: physiologically relevant co-culture systems in drug discovery. Adv Drug Delivery Rev 67-70, 190-204.

19. Yong, X., Wang, P., Jiang, T., Yu, W., Shang, Y., Zhang, P. and Li, Q. (2014) Fibroblasts weaken the anti-tumor effect of gefitinib on co-cultured non-small cell lung cancer cells. Chin Med J 127, 2091-2096.

20. Straussman, R., Morikawa, T., Shee, K., Barzily-Rokni, M., QIan. Z.R., Du, J., Davis, A., Mongare., M.M., Gould, J., Frederick, D.T., Cooper, Z.A., Chapman, P.B., Solit, D.B., Ribas, A., Lo, R.S., Flaherty, K.T., Ogino, S., Wargo, J.A. and Golub, T.R. Tumour micro-environment elicits innate resistance to RAF inhibitors through HGF secretion. Nature 487, 505-509.

21. Lee, J. and Bogyo, M. (2013) Target deconvolution techniques in modern phenotypic profiling. Curr Opin Chem Biol 17, 118-126.

22. 22. DiMasi, J.A., Grabowski, H.G. and Hansen, R.W. (2014) Innovation in the pharmaceutical industry: new estimates of R&D costs. Boston: Tufts Center for the Study of Drug Development.

Page 16: WHITEPAPER Big Data, Wider Mindset€¦ · 5. First in class drugs Percentage of new molecular entities (NME) 23% (17) 37% (28) 75 51% (83) 18% (30) 157 From target-oriented screenings

Visit elsevier.com/rd-solutions or contact your nearest Elsevier office.

ASIA AND AUSTRALIA

Tel: +65 6349 0222Email: [email protected]

JAPAN

Tel: +81 3 5561 5034Email: [email protected]

KOREA AND TAIWAN

Tel: +82 2 6714 3000Email: [email protected]

EUROPE, MIDDLE EAST AND AFRICA

Tel: +31 20 485 3767Email: [email protected]

NORTH AMERICA, CENTRAL AMERICA AND CANADA

Tel: +1 888 615 4500Email: [email protected]

SOUTH AMERICA

Tel: +55 21 3970 9300Email: [email protected]

Copyright© 2015 Elsevier B.V. All rights reserved. July 2015