the building blocks of life. built for you

THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU

Putting Engineering back into Protein Engineering

Jun Liao, UC Santa CruzManfred K. Warmuth, UC Santa Cruz

Jeremy Minshull, DNA 2.0

Protein Engineering Current Paradigms

1. Mechanism-based– (Rational) detailed structural analysis

2. Empiricism-based– (Non-rational ) libraries based

Mechanism-Based Protein Engineering

Based on thermodynamic principles• Calculations are approximate

– calculation cost– structures are really not rigid (MDS)

• Calculations are primarily able to predict binding – catalysis is a special case of binding to a transition state

• Changes in amino acids are designed based on these principles

– very small numbers (<5) of new proteins are synthesized and tested

Empiricism-Based Protein Engineering

• Uses similar principles to evolution– make many variants – screen to find those with the best properties

• No mechanistic understanding needed• Produces large numbers of variants (>1,000) which

are very difficult / expensive to screen for practically relevant properties Proteins related to wild type

Simulated cross over

New variants

The Key Challenge in Protein Engineering

=Reality

What we need is not what we assay for….

Molecular mechanistic models(does not model activity)

High throughput screens(surrogate assays)

Wish List

•No need to develop surrogate assay

•Variants are tested directly under application

conditions

•Rapid process.

Requirements•Identification of appropriate amino acid substitutions•Design and synthesis of information-rich variants•Interpretation of quantitative functional data using machine learning techniques.

What we want in Protein Engineering

Protein Engineering using Machine Learning

Initial designa) Choose substitutions. b) Design an initial variant set (<50) containing those substitutions

Reality checkSynthesize and test the variant set for function(s) of interest.

Machine learningModel the effect of sequence changes on function(s) of interest.

New designPropose a new variant set (<50) based on the model.

Iterate

Select the best variant(s).

Starting pointSelect a protein with some correct initial properties

Engineering of Proteinase K

• Long-term goal of engineering proteinase K to degrade polylactic acid

• Member of the serine protease family– Large amounts of phylogenetic and

sequence information available

• Several different measurable activities available for optimization

Iterate

Expert System for Substitution Selection

Expert system:- Calculation of 9 independent scores that measure changes that have succeeded in other places in Nature- Weight and combine scores to pick best changes

Proteins related to proteinaseK

19 switches = search space of 219 = 500,000

? ? ? ? ?

Finding Optima in Complex Landscapes:Design of Experiment

Changing 1 amino acid at a time

Making multiple changes simultaneously

…Now try to envision doing this not with 2, but 200 amino acids / dimensions

x x xx x x xx

Design of Initial Proteinase K Variants95 97 107123132138145151167180194199208236237265267273293299310332337355

var C S D A V A F A I I S S H V N S I T A C K R N Swt N P S S I E M Y V L Y A K A R P V S G L I K S P1 S N T A R S2 A A A K R S3 C F I S N T4 S A I S V I5 D V H S C N6 S D I V N K7 A A S H S S8 C S I A C R9 C V A F I H10 V N T A R S11 S A S C K N12 C D A I S N13 A A I S H C14 S F N T A K15 V V S I R S16 S A S V C S17 C D I I A K18 F N S I R N19 A V A S H T20 A H V I A C21 D V A F N S22 S I S S S K23 C A I N T R24 C25 S C26 S S C27 V28 D A F29 F I I K S30 F A S K R

Back to Proteinase K

Iterate

First proteinase K dataset

Activity (factor increase relative to wt)

Iterate

Sequence-Activity Modeling: How Does it Work?

1. Represent the sequence as a matrixSeq1 AGRWGIGAYHKLIMASeq2 AGRTGVGVYHKLIMASeq3 AGRWGIGVYHRLIMASeq4 AGRTGVGAYHRLIMAbecomes T W V I V A R Kx x1 x2 x3 x4 x5 x6 x7 x8

Seq1 0 1 0 1 0 1 0 1 Seq2 1 0 1 0 1 0 0 1Seq3 0 1 0 1 1 0 1 0Seq4 1 0 1 0 0 1 1 0

2. Measure the activity or activities of interest under the final application conditions

3. y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi

-0.5 0 0.5 1 1.5 2

Predicted activity

Measured activity

Assessing the Proteinase K Sequence-Activity Relationship

y = c1x1 + c2x2 + c3x3 + c4x4 +… cixi

Learning Methods

• Variety of regression methods– Ridge Regression & Lasso– SVM Regression & LPSVM Regression– Matching Loss Regression & One-norm Matching

Loss Regression– Partial Least Square Regression– LPBoost Regression

• Use bagging to improve the prediction stability

Variants Design I

• Main issue: Exploitation vs. Exploration

• Optimum design (Exploitation)– Take the combination of substitutions

predicted to have maximal activity– Also consider

• Substitution frequency in the dataset• Variation of weight estimation.

– Used in 2nd & 3rd iterations

Variants Design II

• Diversity design (Exploration)– Calculate the combination of

substitutions predicted to have maximal activity that is also

• No more than 5 changes from a sequence that has already been tested

• No closer than 3 changes from a sequence that has already been tested or selected for synthesis

– Used in 2nd iteration

eThree Iterations of Activity Engineering

Variants in order synthesized

0 20 40 60 80 120

1st set: 34 variants

2nd set: 24 variants

3rd set: 38 variants

wild-type

ONLY 58 variants were tested to allow design of the fourth set, which contained •3 variants 20-30 x improved over wild-type•50% of variants more active than the best of previous sets•70% of variants more active than wild types•3-11 changes found in variants better than WT

Improving ActivityActivity Improvement

v501 v502 v503 v505 v513 v515 v518 v526 v544 v545 v551 v556 v557 v558 v560 NS9

Activity (pm

ol/s/ml)

Activity (pmol/s/ml)

Half life at 68°C (s)

107 123 132 145 151 167 180 194 199 208 237 265 267 273 293 310 332 337 355WT S S I M Y V L Y A K R P V S G I K S P501 A A

502 A H A503 A H A R

505 A H T A R N513 A I H T A R N

515 V A A518 V A I I T A

526 A V A I T A544 V A H T A N

545 V A H T A R N551 A T A

556 V A I T A557 A H A R N

558 V A H T A R560 A V A I H T A N

Variants are Improved in Multiple Properties

Conclusions• Machine learning

– Making a very small number of variants (58) allows a productive search of a total space with 500,000 possible combinations

• Synthetic Biology– Recent advances in gene synthesis methods

were essential for this type of exploration

The Future• Proteins are the building blocks of life with a wide

array of applications (therapeutics, diagnostics, industrial catalysts)

• Finding a reliable mechanism for optimizing proteins for human applications would be an amazing feat

• We steal ideas about how proteins evolve from nature, but optimize proteins outside their in vivo constraints (the proteins don’t have to be compatible with life)

the building blocks of life. built for you

design of initial proteinase

protein engineeringwish

design of experimentchanging

amino acids dimensionsaa

variants screen

uc santa cruzmanfred

synthesis of information

uc santa cruzjeremy

Documents

building blocks | blocks puzzle | toy blocks - sluban...

building blocks east

blackboard building blocks

underlying principles – the building blocks underlying...

building blocks: protocols

system models mathematical models mechanical system building...

chemical building blocks

building blocks edi

arithmetic building blocks

minerals building blocks of rocks minerals building blocks...

our building blocks

cmdb building blocks

basic building blocks

its all about building blocks. all built from a few building...

3d cubes building blocks stacked built out of puzzle...

utilizing a simplified user experience · what are the...

building blocks and cognitive building blocks

sysml building blocks for cost modeling: towards model...

building blocks

school building cubes building blocks logical education...