the building blocks of life. built for you

Post on 05-Jan-2016

34 Views

Category:

Documents

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

Putting Engineering back into Protein Engineering Jun Liao, UC Santa Cruz Manfred K. Warmuth, UC Santa Cruz Jeremy Minshull, DNA 2.0. THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU. Protein Engineering Current Paradigms. Mechanism-based (Rational) detailed structural analysis Empiricism-based - PowerPoint PPT Presentation

TRANSCRIPT

THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU

Putting Engineering back into Protein Engineering

Jun Liao, UC Santa CruzManfred K. Warmuth, UC Santa Cruz

Jeremy Minshull, DNA 2.0

Protein Engineering Current Paradigms

1. Mechanism-based– (Rational) detailed structural analysis

2. Empiricism-based– (Non-rational ) libraries based

Mechanism-Based Protein Engineering

Based on thermodynamic principles• Calculations are approximate

– calculation cost– structures are really not rigid (MDS)

• Calculations are primarily able to predict binding – catalysis is a special case of binding to a transition state

• Changes in amino acids are designed based on these principles

– very small numbers (<5) of new proteins are synthesized and tested

Empiricism-Based Protein Engineering

• Uses similar principles to evolution– make many variants – screen to find those with the best properties

• No mechanistic understanding needed• Produces large numbers of variants (>1,000) which

are very difficult / expensive to screen for practically relevant properties Proteins related to wild type

Simulated cross over

New variants

The Key Challenge in Protein Engineering

=Reality

What we need is not what we assay for….

Molecular mechanistic models(does not model activity)

High throughput screens(surrogate assays)

Wish List

•No need to develop surrogate assay

•Variants are tested directly under application

conditions

•Rapid process.

Requirements•Identification of appropriate amino acid substitutions•Design and synthesis of information-rich variants•Interpretation of quantitative functional data using machine learning techniques.

What we want in Protein Engineering

Protein Engineering using Machine Learning

Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions

Reality checkSynthesize and test the variant set for function(s) of interest.

Machine learningModel the effect of sequence changes on function(s) of interest.

New designPropose a new variant set (<50) based on the model.

Iterate

End

Select the best variant(s).

Starting pointSelect a protein with some correct initial properties

Engineering of Proteinase K

• Long-term goal of engineering proteinase K to degrade polylactic acid

• Member of the serine protease family– Large amounts of phylogenetic and

sequence information available

• Several different measurable activities available for optimization

Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions

Reality checkSynthesize and test the variant set for function(s) of interest.

Machine learningModel the effect of sequence changes on function(s) of interest.

New designPropose a new variant set (<50) based on the model.

Iterate

End

Select the best variant(s).

Starting pointSelect a protein with some correct initial properties

Protein Engineering using Machine Learning

Expert System for Substitution Selection

Expert system:- Calculation of 9 independent scores that measure changes that have succeeded in other places in Nature- Weight and combine scores to pick best changes

Proteins related to proteinaseK

19 switches = search space of 219 = 500,000

? ? ? ? ?

Finding Optima in Complex Landscapes:Design of Experiment

Changing 1 amino acid at a time

Making multiple changes simultaneously

…Now try to envision doing this not with 2, but 200 amino acids / dimensions

x x xx x x xx

xx

Aa 2

Aa 1

x

xx

xx

x x

Aa 2

Aa 1

Design of Initial Proteinase K Variants95 97 107123132138145151167180194199208236237265267273293299310332337355

var C S D A V A F A I I S S H V N S I T A C K R N Swt N P S S I E M Y V L Y A K A R P V S G L I K S P1 S N T A R S2 A A A K R S3 C F I S N T4 S A I S V I5 D V H S C N6 S D I V N K7 A A S H S S8 C S I A C R9 C V A F I H10 V N T A R S11 S A S C K N12 C D A I S N13 A A I S H C14 S F N T A K15 V V S I R S16 S A S V C S17 C D I I A K18 F N S I R N19 A V A S H T20 A H V I A C21 D V A F N S22 S I S S S K23 C A I N T R24 C25 S C26 S S C27 V28 D A F29 F I I K S30 F A S K R

Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions

Back to Proteinase K

Reality checkSynthesize and test the variant set for function(s) of interest.

Machine learningModel the effect of sequence changes on function(s) of interest.

New designPropose a new variant set (<50) based on the model.

Iterate

End

Select the best variant(s).

Starting pointSelect a protein with some correct initial properties

Protein Engineering using Machine Learning

First proteinase K dataset

0

0.5

1

1.5

2

2.5

3

Activity (factor increase relative to wt)

Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions

Reality checkSynthesize and test the variant set for function(s) of interest.

Machine learningModel the effect of sequence changes on function(s) of interest.

New designPropose a new variant set (<50) based on the model.

Iterate

End

Select the best variant(s).

Starting pointSelect a protein with some correct initial properties

Protein Engineering using Machine Learning

Sequence-Activity Modeling: How Does it Work?

1. Represent the sequence as a matrixSeq1 AGRWGIGAYHKLIMASeq2 AGRTGVGVYHKLIMASeq3 AGRWGIGVYHRLIMASeq4 AGRTGVGAYHRLIMAbecomes T W V I V A R Kx x1 x2 x3 x4 x5 x6 x7 x8

Seq1 0 1 0 1 0 1 0 1 Seq2 1 0 1 0 1 0 0 1Seq3 0 1 0 1 1 0 1 0Seq4 1 0 1 0 0 1 1 0

2. Measure the activity or activities of interest under the final application conditions

3. y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi

-0.5

0

0.5

1

1.5

2

2.5

3

-0.5 0 0.5 1 1.5 2

Predicted activity

Measured activity

Assessing the Proteinase K Sequence-Activity Relationship

wt

y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi

Learning Methods

• Variety of regression methods– Ridge Regression & Lasso– SVM Regression & LPSVM Regression– Matching Loss Regression & One-norm Matching

Loss Regression– Partial Least Square Regression– LPBoost Regression

• Use bagging to improve the prediction stability

Variants Design I

• Main issue: Exploitation vs. Exploration

• Optimum design (Exploitation)– Take the combination of substitutions

predicted to have maximal activity– Also consider

• Substitution frequency in the dataset• Variation of weight estimation.

– Used in 2nd & 3rd iterations

Variants Design II

• Diversity design (Exploration)– Calculate the combination of

substitutions predicted to have maximal activity that is also

• No more than 5 changes from a sequence that has already been tested

• No closer than 3 changes from a sequence that has already been tested or selected for synthesis

– Used in 2nd iteration

80 90

Act

ivity

rel

ativ

e to

wild

typ

eThree Iterations of Activity Engineering

Variants in order synthesized

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 120

1st set: 34 variants

2nd set: 24 variants

3rd set: 38 variants

wild-type

100

ONLY 58 variants were tested to allow design of the fourth set, which contained •3 variants 20-30 x improved over wild-type•50% of variants more active than the best of previous sets•70% of variants more active than wild types•3-11 changes found in variants better than WT

Improving ActivityActivity Improvement

0

100

200

300

400

500

600

700

v501 v502 v503 v505 v513 v515 v518 v526 v544 v545 v551 v556 v557 v558 v560 NS9

Activity (pm

ol/s/ml)

0

2

4

6

8

10

12

14

Activity (pmol/s/ml)

Half life at 68°C (s)

Hal

f lif

e at

68°

C (

s)

107 123 132 145 151 167 180 194 199 208 237 265 267 273 293 310 332 337 355WT S S I M Y V L Y A K R P V S G I K S P501 A A

502 A H A503 A H A R

505 A H T A R N513 A I H T A R N

515 V A A518 V A I I T A

526 A V A I T A544 V A H T A N

545 V A H T A R N551 A T A

556 V A I T A557 A H A R N

558 V A H T A R560 A V A I H T A N

Variants are Improved in Multiple Properties

Conclusions• Machine learning

– Making a very small number of variants (58) allows a productive search of a total space with 500,000 possible combinations

• Synthetic Biology– Recent advances in gene synthesis methods

were essential for this type of exploration

The Future• Proteins are the building blocks of life with a wide

array of applications (therapeutics, diagnostics, industrial catalysts)

• Finding a reliable mechanism for optimizing proteins for human applications would be an amazing feat

• We steal ideas about how proteins evolve from nature, but optimize proteins outside their in vivo constraints (the proteins don’t have to be compatible with life)

top related