bayesian optimization for synthetic gene...

38
Bayesian Optimization for Synthetic Gene Design Javier Gonz´ alez DDHL workshop, Nottingham, UK, 2015

Upload: others

Post on 21-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Bayesian Optimization for Synthetic GeneDesign

Javier Gonzalez

DDHL workshop, Nottingham, UK, 2015

Page 2: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

At the University of Sheffield:

I Department of Computer Science.

I Department of Chemical and Biological Engineering.

Neil D. Lawrence, David James, Joseph Longworth, Paul Dobson,Josselin Noirel and Mark Dickman

Page 3: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

The big picture

I 8 of the 10 top-selling drugs in are biologics (monoclonalantibodies) used in rheumatology, dermatology, and varioustypes of Cancer.

I Huge market of $73 billion just in Europe.

I Growing interest in the availability of biosimilars.

Page 4: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

New drug production paradigm

I Use mammalian cells to make protein products.I Control the ability of the cell-factory to use synthetic DNA.

Cornerstone of modern biotechnology: Design DNA code that willbest enable the cell-factory to operate most efficiently.

Synthetic genes!

Page 5: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

How does a cell work?

A mammalian cell in numbers:

I approx. 20,000 genes able to produce 20,000 proteins.

I A few of them are of therapeutical interest.

I The average gene length is 7902 bases pairs (A,T G, C).

I Millions of molecules interactions.

Central dogma of systems biology

’Natural’ genes are not optimized to maximize protein production.

Page 6: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Why can we rewrite the genetic code?

Considerations

I Different gene sequences may encode the same protein...

I ...but the sequence affects the synthesis efficiency.

I The codon usage is the key (codon = triplet of bases).

The genetic code is redundant:

UUGACA = UUGACU

Both genes encode the same protein.

Page 7: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Page 8: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Page 9: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Page 10: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Page 11: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Key question

Given a protein of interest, which is the recombinant genesequence that will enable the cell to produce it in the most

efficient way?

I Average mamalian gene: 7000 nucleotides.

I Consider a gene with coding region of 900 nucleotides: 300codons.

I Assume only pairs of synonymous codons.

I ≈ 2300 ≈ 2× 1090 possible recombinant gene alternatives (inthe order of the number of atoms in the universe).

Page 12: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Machine Learning challenges

I Very complex cell behaviour. Limited prior knowledge.

I Multi-task optimization problem: increase cell efficiency,maintain cell survival, control protein and mRNA stability.

I Lab experiments are very expensive.

I Gene tests can be run in parallel.

I The design space is defined in terms of long string sequences.I Alternative: gene features, high dimensional problem.

Page 13: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Gaussian processes

Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−4

−3

−2

−1

0

1

2

3

4

f

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Page 14: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Gaussian processes

Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−4

−3

−2

−1

0

1

2

3

4

f

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Page 15: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Gaussian process

Gaussian Processes: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

f

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Page 16: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Gaussian processes

Gaussian Process: Probability density over functions, such thateach linear finite dimensional restriction is multivariate Gaussian.

0 5 10 15

t−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

I Fully parametrized by a covariance function K .I Close-form posterior under Gaussian likelihoods.

Page 17: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Page 18: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 19: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 20: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 21: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 22: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 23: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 24: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 25: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 26: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Tools: Bayesian Optimization

BO: Heuristic to reduce the number of evaluations in optimizationproblems [Mockus, 1978; Snoek et al., 2012].

Example: x∗ = arg min[0,1] f (x)?

Expected Improvement: Ex [max(0, f (xmin)− f (x))]

Page 27: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

How to design a synthetic gene?

A good model is crucial

Gene sequence features → protein production efficiency.

Bayesian Optimization principles for gene design

do:

1. Build a GP model as an emulator of the cell behavior.

2. Obtain a set of gene design rules (features optimization).

3. Design one/many new gene/s coherent with the design rules.

4. Test genes in the lab (get new data).

until the gene is optimized (or the budget is over...).

Page 28: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Model as an emulator of the cell behavior

Model inputsFeatures (xi ) extracted gene se-quences (si ): codon frequency, cai,gene length, folding energy, etc.

Model outputsTranslation and trasncriptions ratesf := (fα, fβ).

Model typeMulti-output Gaussian process f ≈GP(m,K) where K is a corregion-alization covariance for the two-output model (+ SE with ARD).

The correlation in the outputs help!

Page 29: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Obtaining optimal gene design rules

Maximize the averaged Expected improvement for both outputs[Swersky et al. 2013]

α(x) = σ(x)(−uΦ(−u) + φ(u))

where u = (ymax − m(x))/σ(x) and

m(x) =1

2

∑l=α,β

f∗(x), σ2(x) =1

22

∑l ,l ′=α,β

(K∗(x, x))l ,l ′ .

A batch method is used when several experiments can be run inparallel

Page 30: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Designing new genes coherent with the optimal design rules

Simulating-matching approach:

1. Simulate genes ‘coherent’ with the target (same aminoacids).

2. Extract features.

3. Rank synthetic genes according to their similarity with the‘optimal’ design rules.

Ranking criterion: eval(s|x?) = ∑pj=1 wj |xj − x?j |

I x?: optimal gene design rules.

I s, xj generated ‘synonyms sequence’ and its features.

I wj : weights of the p features (inverse lengthscales of themodel covariance).

Page 31: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Experiments

I Optimization gene designs in mammalian cells.

I Dataset in Schwanhausser et al. (2011) for 3810 genes rates.Sequences were extracted fromhttp:wet-labpic/www.ensembl.org.

I 250 features involving 5’UTR, 3’UTR and coding region.

I Gaussian process with ARD and coregionalized outputs.

I Selection of 10 random difficult-to-express genes (average logratio < 1.5).

I 10,000 random ‘synonyms sequences’ generated from eachgene.

Page 32: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Features importance

Page 33: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Predictions

The model is able to predict translation rates:

Page 34: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Results for 10 low-expressed genes

Results from simulation: currently testing the results in the lab!

Page 35: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Results for the CMV-pp65 Recombinant Protein

Alternative model: translation rates + mRNA half life.

s half-life CMV mRNA half-lifeCMV mRNA half-life

0 2000 4000 6000 8000 10000

Gene design

4

5

6

7

8

9

10

Tra

nsl

ati

on r

ate

(lo

g2

)

CMV - mRNA translation

Average prediction

1*stevd

Wild type

Predicted results for the 10,000 simulated sequences.

Page 36: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Results for the CMV-pp65 Recombinant Protein

2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8

Half-life(log2)

3

4

5

6

7

8

9

10

Tra

nsl

ati

on r

ate

(lo

g2)

Pareto frontier

CMV - Optimal genes

Optimal genes

In silico designed genes

Multi-objective optimization problem.

Page 37: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Prediction/design web tool

Web app for gene design based on

I GPy (https://github.com/SheffieldML/GPy).

I GPyOpt (https://github.com/SheffieldML/GPyOpt).

Page 38: Bayesian Optimization for Synthetic Gene Designjaviergonzalezh.github.io/presentations/slides_ddhl2015.pdfAt theUniversity of She eld: I Department of Computer Science. I Department

Final remarks

I Bayesian optimization is a promising technique to designsynthetic genes: reduces drastically the number of neededexperiments.

I Very important aspect of the problem → to have a goodsurrogate model for the cell behavior.

I Currently, working out a model with more outputs, such asthe protein stability and cell survival.

I Alternative approach: focus on the direct optimization of thesequences. Combinatorial problem.