estimating fitness landscapes john pinney [email protected]

of 25 /25
Estimating fitness landscapes John Pinney [email protected]

Author: duane-ball

Post on 12-Jan-2016




0 download

Embed Size (px)


Estimating fitness landscapes

Estimating fitness landscapesJohn [email protected]

Genotype network

Genotype network0 = Wild type

Genotype network01

Genotype network012345

Genotype network0123451234123121

Genotype network

+Fitness values at every node

=Fitness landscape

With an accurate fitness landscape we could predict:

mutational trajectories e.g. under drug treatment.

rates of emergence of drug resistance.

optimal drug combinations to prevent emergence of drug resistance.

At best, fitness data for only relatively few genotypes will be available.

How can we estimate unobserved values?

How can we tell if these estimates are good enough for real applications of fitness landscapes?

How can we estimate unobserved values?

Specific mutations are expected to contribute to fitness in different ways

=>Machine learning based on mutations as features.

HIV-1 drug resistance database

A great resource for exploring genotype-phenotype relationships.

Includes a large amount of sequence data from clinical and lab studies from early 1990s onwards.

In vitro dataViruses with known sequence are assayed to assess their ability to reproduce in vitro in the presence of various drugs.

Most of these isolates were obtained from patients who may have been untreated or on any number of drug regimes.=> some biases in sequence coverage

Genotypes are described using mutations relative to a particular consensus sequence (e.g. subtype B)

Summary of Phenosense results for a variety of protease inhibitors (PIs).

Machine learning from in vitro dataUsing mutations relative to the consensus sequence as indicator variables, we can apply standard machine learning techniques to predict fitness under a given condition from the sequence.

Given the large number of uninformative features, LASSO and other techniques that include feature selection tend to do well.

from Rhee et al.(2006)

using least-squared regression to obtain coefficients for contribution of each mutation to resistance against a selection of PI drugs.

from Hinkley et al.(2011)

using generalised kernel ridge regression.

tested model using only main effects (ME) against model incorporating epistasis: inter-genic, intra-genic or both (MEEP)

from Hinkley et al.(2011)

These authors found ~18% improvement in predictive power by including epistasis between mutations within the same gene e.g. the HIV protease shown.

In vivo dataA drug resistance fitness landscape in vitro may not be the same as that experienced by the virus when exposed to the patients immune system.

Another approach is to learn fitness landscapes by comparing the sequences of drug-nave viruses against those obtained from patients on a specific drug regime.Machine learning from in vivo dataDeforche et al. (2008) apply a Bayesian Network

Probability of a set of mutations (A1,A2,...,An)Fitness of a set of mutations (A1,A2,...,An)A phylogenetic guide tree is used to take sequence sampling bias into account

Predicting and validating mutational trajectories

Where next?