chapter 3 - designing genes for successful protein expression -...

24
CHAPTER THREE Designing Genes for Successful Protein Expression Mark Welch, Alan Villalobos, Claes Gustafsson, and Jeremy Minshull Contents 1. Introduction 44 2. Gene Design Software 45 3. General Sequence Parameters Affecting Protein Expression 45 3.1. Initiation of translation 45 3.2. Codon bias 49 3.3. mRNA structure and translational elongation 56 4. Protein-Specific Factors Providing Additional Complexity 56 4.1. Protein toxicity 57 4.2. Transmembrane proteins 58 4.3. cis-Regulatory regions 59 5. Conclusions 61 References 62 Abstract DNA sequences are now far more readily available in silico than as physical DNA. De novo gene synthesis is an increasingly cost-effective method for building genetic constructs, and effectively removes the constraint of basing constructs on extant sequences. This allows scientists and engineers to experi- mentally test their hypotheses relating sequence to function. Molecular biolo- gists, and now synthetic biologists, are characterizing and cataloging genetic elements with specific functions, aiming to combine them to perform complex functions. However, the most common purpose of synthetic genes is for the expression of an encoded protein. The huge number of different proteins makes it impossible to characterize and catalog each functional gene. Instead, it is necessary to abstract design principles from experimental data: data that can be generated by making predictions followed by synthesizing sequences to test those predictions. Because of the degeneracy of the genetic code, design of gene sequences to encode proteins is a high-dimensional problem, so there is no single simple Methods in Enzymology, Volume 498 # 2011 Elsevier Inc. ISSN 0076-6879, DOI: 10.1016/B978-0-12-385120-8.00003-6 All rights reserved. DNA2.0, Inc., Suite A, Menlo Park, California, USA 43

Upload: vothu

Post on 19-Dec-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

C H A P T E R T H R E E

M

IS

D

ethods

SN 0

NA2.

Designing Genes for Successful

Protein Expression

Mark Welch, Alan Villalobos, Claes Gustafsson, and

Jeremy Minshull

Contents

1. In

in

076

0, I

troduction

Enzymology, Volume 498 # 2011

-6879, DOI: 10.1016/B978-0-12-385120-8.00003-6 All rig

nc., Suite A, Menlo Park, California, USA

Else

hts

44

2. G

ene Design Software 45

3. G

eneral Sequence Parameters Affecting Protein Expression 45

3

.1. In itiation of translation 45

3

.2. C odon bias 49

3

.3. m RNA structure and translational elongation 56

4. P

rotein-Specific Factors Providing Additional Complexity 56

4

.1. P rotein toxicity 57

4

.2. T ransmembrane proteins 58

4

.3. c is-Regulatory regions 59

5. C

onclusions 61

Refe

rences 62

Abstract

DNA sequences are now far more readily available in silico than as physical

DNA. De novo gene synthesis is an increasingly cost-effective method for

building genetic constructs, and effectively removes the constraint of basing

constructs on extant sequences. This allows scientists and engineers to experi-

mentally test their hypotheses relating sequence to function. Molecular biolo-

gists, and now synthetic biologists, are characterizing and cataloging genetic

elements with specific functions, aiming to combine them to perform complex

functions. However, the most common purpose of synthetic genes is for the

expression of an encoded protein.

The huge number of different proteins makes it impossible to characterize

and catalog each functional gene. Instead, it is necessary to abstract design

principles from experimental data: data that can be generated by making

predictions followed by synthesizing sequences to test those predictions.

Because of the degeneracy of the genetic code, design of gene sequences to

encode proteins is a high-dimensional problem, so there is no single simple

vier Inc.

reserved.

43

Page 2: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

44 Mark Welch et al.

formula to guarantee success. Nevertheless, there are several straightforward

steps that can be taken to greatly increase the probability that a designed

sequence will result in expression of the encoded protein.

In this chapter, we discuss gene sequence parameters that are important for

protein expression. We also describe algorithms for optimizing these para-

meters, and troubleshooting procedures that can be helpful when initial

attempts fail. Finally, we show how many of these methods can be accom-

plished using the synthetic biology software tool Gene Designer.

1. Introduction

A major objective of synthetic biology is to characterize biologicalcomponents with sufficient precision to enable these components to becombined to produce predictable outcomes. Progress has been made indefining functional parameters for some elements. Particularly, those withregulatory functions that act to control transcription (promoters, operators,repressors, and activators) are now reasonably well characterized (see http://www.partsregistry.org; Lisser and Margalit, 1993; Peccoud et al., 2008).However, reaching the ultimate targets of synthetic biology projects willrequire the balanced control of both transcription and translation in order toachieve controlled protein expression, whether those targets are engineeredpathways for producing metabolites, remodeled photosynthesis, or treesthat can turn into houses. Proteins are not necessarily the components ofregulatory networks; they may also be catalysts that interact with cellularmetabolism, structural parts of the cell, or therapeutically active compounds.Unfortunately, understanding transcriptional regulation is not sufficient toprovide control of protein production.

The characterization of sequences governing translation has provedchallenging. This is largely because translational determinants interactwith, or are embedded within the sequences that encode the polypeptide.Consequently, there is not yet a perfectly robust way to convert a virtualamino acid sequence to a DNA sequence that will, when introduced into adesired host cell, yield sufficient protein for a specific downstream applica-tion. Here, we describe recently developed tools and technologies for genedesign, and discuss the heuristic basis of our understanding of particularlyimportant design features.

Translation can be controlled at the level of initiation and elongation.Initiation of translation is primarily dependent on the sequence of theribosome binding site (RBS) and early mRNA secondary structure (Allertet al., 2010; Kudla et al., 2009; Salis et al., 2009). Other determinants ofprotein expression are less well understood but equally potent. Differentproteins expressed from the same promoter with the same RBS or 50

Page 3: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 45

untranslated region (UTR) may be expressed at wildly different levels. Evendifferent ways of encoding the same protein, under otherwise identicalconditions, can result in protein concentrations differing by 100-fold(Allert et al., 2010; Kudla et al., 2009; Welch et al., 2009b). Understandingthese determinants would greatly enhance our ability to express proteins atspecific desired levels. In the best case, we could hope to use the controlthey offer. At the least, it would be helpful if we could eliminate them sothat we could rely on the controls we do understand. Experimental data onthe influence of gene design on heterologous expression are rapidly grow-ing, and design algorithms derived from these experiments provide both anincreased probability of success in individual projects and a starting point forfurther experimentation.

2. Gene Design Software

Backtranslation from a polypeptide sequence to obtain a DNAsequence requires choosing between an enormous number of possibilities(Welch et al., 2009b).We use the backtranslation module of Gene Designer,a free software tool (www.dna20.com/genedesigner2), to select sequenceswith specific design characteristics. Backtranslation parameters can bealtered by selecting backtranslation profiles from the Configure menu inthe Project Window (see Fig. 3.1). These parameters will be discussed inmore detail in the following sections.

3. General Sequence Parameters Affecting

Protein Expression

Evidence that recoding a gene can radically change its expression hasbeen accumulating over the past two decades (Gustafsson et al., 2004;Welch et al., 2009b). However, it is only in the last year or two thatexperiments have compared the expression of many different individualgenes encoding the same protein. These experiments are finally allowinghypotheses about the causes of expression differences to be tested.

3.1. Initiation of translation

A key component affecting initiation of translation in prokaryotes is theRBS that occurs between 5 and 15 bases upstream of the open reading frame(ORF) AUG start codon. Binding of the ribosome to the Shine–Dalgarno(SD) sequence within the RBS localizes the ribosome to the initiation codon.This binding is primarily due to direct base pairing with the anti-SD region

Page 4: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Figure 3.1 Project Window, My Backtranslation Profiles, and Backtranslation Profile Editor.From the Configure Menu, choose Backtranslation Profiles to openMy BacktranslationProfiles. Then select a profile and click edit (pencil icon) or double click on a profile toopen the Backtranslation Editor. In the editor, you can change parameters related to thegenetic algorithm, codon usage, sequences to avoid, 50 structure, repeats, and homolo-gous DNA.

46 Mark Welch et al.

of the 16S rRNA of the small ribosome subunit and can be greatly influencedby context (Komarova et al., 2005; Lee et al., 1996; Shultzaberger et al., 2001;Vimberg et al., 2007). Changes in RBS sequences can change expressionlevels over more than three orders of magnitude. Affinity of the RBS for theribosome is a critical factor controlling the efficiency with which newpolypeptide chains are initiated. This interaction is in competition withpossible base-pairing interactions involving the RBS region that may formwithin the mRNA itself. Thus, SD sequences with weaker base pairing to theribosome are more susceptible to interference from mRNA structure. How-ever, some experiments suggest that SD sequences with too strong affinity canbe deleterious, particularly at lower temperatures, by stalling initial elongation

Page 5: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 47

(Komarova et al., 2002; Vimberg et al., 2007). Also critical is the distancebetween the RBS and the start codon with 5–7 bases from the consensus SDAGGAGG being optimal (Chen et al., 1994). Models that factor competitionbetween the anti-SD andmRNA structure as well as start codon spacing havebeen shown to approximate actual translation initiation rates (de Smit and vanDuin, 2003; Na et al., 2010; Salis et al., 2009).

Much prior work has demonstrated that mRNA structures that occludethe region of the RBS and/or start codon in genes expressed in prokaryotescan impair expression (de Smit and van Duin, 1990, 1994; Griswold et al.,2003; Kozak, 1986; Kudla et al., 2009; Studer and Joseph, 2006). For thisreason, gene design strategies often avoid such structures in choosing codingof the first several amino acids. Salis and coworkers have recently developeda thermodynamic model that captures competition between internalmRNA structures and the binding of the ribosome to the RBS (Saliset al., 2009). An alternative mathematical model of initiation has also beenproposed based on similar considerations (Na et al., 2010). The Salis modelis the basis of an online tool that can be used to design RBSs with modifiedrates of initiation of translation (http://www.voigtlab.ucsf.edu/software/).In its current stage of development, this tool is best suited for attenuatingexpression of an existing gene.

In eukaryotes, translation initiation is significantly different from that inprokaryotes, and multiple mechanisms have been characterized. Most initi-ation of translation from polymerase II-derived transcripts proceeds viarecognition of the m7G cap at the 50 terminus of the mRNA followed byscanning of the ribosome to the initiation codon, which is identified byproximity to the 50-end and sequence context (Kozak, 1999, 2005; Pestovaet al., 2001; Preiss and Hentze, 1999). Several factors are apparentlyinvolved in unwinding structure in the region from the cap to the startcodon (Parsyan et al., 2009; Pisareva et al., 2008). Alternatively, initiation forsome genes can occur via recognition of internal mRNA elements thatrecruit ribosomes to the message and direct them to the start codon (Berryet al., 2010; Gazo et al., 2004; Pestova et al., 2001).

Numerous lines of evidence suggest that the initial 15–25 codons of theORF deserve special consideration in gene optimization (Allert et al., 2010;Chen and Inouye, 1994; Eyre-Walker and Bulmer, 1993; Gonzalez deValdivia and Isaksson, 2004, 2005; Kudla et al., 2009; Stenstrom andIsaksson, 2002; Stenstrom et al., 2001a,b; Tuller et al., 2010). Studies haveshown that the impact of rare codons on translation rate is particularly strongin these first codons, for expression in both Escherichia coli and Saccharomycescerevisiae (Chen and Inouye, 1990, 1994; Hoekema et al., 1987). In E. coli,peptidyl-tRNA drop-off during translation of the initial codons appears tobe accentuated by the presence of rare or NGG codons (Cruz-Vera et al.,2004; Gonzalez de Valdivia and Isaksson, 2004, 2005). These effects appearto be independent of local mRNA secondary structure. The impact of early

Page 6: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

48 Mark Welch et al.

rare codons may in some cases be suppressed by the overexpression ofcognate tRNAs; however, such a strategy does not suppress the effect ofNGG codons (Gonzalez de Valdivia and Isaksson, 2005). It is also true thatexpression may be recovered by 50 sequence replacement even forsequences that do not show especially strong mRNA structure or containrare codons or other obvious deleterious elements in this region (Welchet al., 2009a).

3.1.1. Avoiding mRNA structure in gene designBacktranslation in Gene Designer allows special treatment of the 50-end of themRNA, with the goal of reducing secondary RNA structure. The user is ableto define multiple structure identification strategies. Each strategy is weightedfor fitness scoring in the genetic algorithm. To configure each strategy, theuser can define the search window (in base pairs), minimum stem size,minimum loop size, maximum loop size, and the scoring weight. Duringbacktranslation, Gene Designer uses a sliding window technique to evaluateall possible single loop structures within the constraints given by the strategy.

One challenge in trying to both minimize 50 structure and match acodon bias is that the two often pull designs in opposite directions. Tomitigate this conflict, it can be helpful to first minimize the 50 structure, andthen create the remainder of the gene to give an overall match to the desiredcodon bias. It can sometimes be more difficult to minimize structurewithout using undesired codons in the important early coding region.

3.1.2. N-terminal tags to improve expressionMaking N-terminal fusions can be a way to improve the expression ofrecalcitrant proteins either by displacing mRNA structure from the initia-tion region or by improving the physical integrity of the protein(Hammarstrom et al., 2002; Korepanova et al., 2007; Smyth et al., 2003).Some useful fusion tags are loaded into the Gene Designer Library. Theycan be added to the N-terminus of a protein by dragging from the Libraryand in front of the coding region of the protein (see Fig. 3.2). Because thesequences can also be edited, the original N-terminal methionine may beremoved if desired. As an example, the mRNA encoding one particularprotein (“ProtA”) was prone to form a very strong hairpin in first 15 codonsof the ORF. No coding could be found to remove strong predictedstructure in this region. A codon-optimized version of the gene showedweak full-length expression from either of two 50-end codings designed tominimize mRNA structure. One coding gave no detectable expression (notshown), whereas the other gave weak yield of full-length protein along witha more significant level of a truncated product, perhaps due to internalinitiation or protein degradation. However, displacement of the initialsequence by an N-terminal fusion to maltose binding protein (MBP;Korepanova et al., 2007; Smyth et al., 2003) greatly improved expression

Page 7: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Figure 3.2 Library Explorer and Project Window showing Sequence View. To edit thesequence of an element in sequence view, simply select the DNA region of the elementin question and click on the Edit link button bellow the DNA strands.

Gene Design and Protein Expression 49

(see Fig. 3.3). Improved full-length expression was also seen when an18-codon phage gIII secretion leader sequence was added to the 50-end.Although it is tempting to interpret these results as meaning that a limiting 50sequence was replaced with nonlimiting ones, we have observed the effectof fusions to be highly gene dependent. In the case of another gene(“ProtB”), adding the same gIII coding sequence proved to lower expressionsignificantly, well below that of gIII_ProtA (see Fig. 3.3). This discrepancy isnot explained by predicted local mRNA structure. Neither the original norgIII-fused versions of ProtB genes have strong predicted structure in theRBS and initial coding regions of the mRNA and both show significantlyless structure than the higher expressing gIII_ProtA. It remains to be deter-mined why such conditional effects are observed. Clearly 50 replacement canbe a useful tool to improve gene expression in some cases, but much is still tobe learned about the interdependence of the 50 region and downstreamsequence or other protein characteristics.

3.2. Codon bias

Each amino acid is encoded by as few as one (methionine and tryptophan)to as many as six codons (arginine, leucine, and serine) in the canonicalgenetic code. Different organisms use synonymous codons with differentapparent preferences. This is exemplified in the far range of G þ C content

Page 8: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

62

49

38

ProtA?28

gIII_ProtA

gIII

_Pro

tAMBP_ProtA

MB

P_P

rotA

Pro

tA

CProtB

– + – +gIII_ProtB

Figure 3.3 Impact of N-terminal fusions on expression of ProtA and ProtB in E. coli. Leftpanel: PAGE analysis of ProtA and N-terminal fusions with gIII andMBP are shown. C,control with empty vector. Numbers to left indicate positions of MW standards (kDa).Right panel: PAGE showing expression for uninduced (�) and induced (þ) cultures ofProtB and gIII-fused ProtB.

50 Mark Welch et al.

found in bacterial coding sequences that use G þ C in the third position ofcodons as low as approximately 10% (e.g., Buchnera sp.) to as high asapproximately 90% (e.g., Streptomyces sp.; Sharp et al., 2005). Further,significant bias in codon usage exists between the complete transcriptomeand genes that are highly expressed in some organisms (Sharp et al., 1988).The reasons for these differences have been the subject of considerablespeculation (Akashi, 2001; Akashi and Gojobori, 2002; Eyre-Walker,1996; Eyre-Walker and Bulmer, 1993, 1995; Holm, 1986; Knight et al.,2001; Marquez et al., 2005; Rocha, 2004; Suzuki et al., 2008; Yang andNielsen, 2008). Initial gene designs were guided by host codon bias—areasonable approach given that the abundance of cognate tRNAs is gener-ally correlated to codon usage frequency (Bulmer, 1987; Dong et al., 1996;Kanaya et al., 2001).

3.2.1. Approximating the host codon biasThere are two intuitively sensible ways in which host codon use frequenciescan be adapted for gene design. The first is to select the codon that is usedmost often for each amino acid, either among the entire transcriptome orthat for the most highly expressed genes, and use that exclusively within the

Page 9: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 51

design. Genes preferring such codons are often referred to as more“adapted” for expression (Sharp and Li, 1987). The underlying andunproven assumption is that the most common codon corresponds to thehighest translational efficiency in heterologous expression. This method hasmany potential drawbacks. If only one codon is used to encode each aminoacid, there is only a single possible DNA sequence with which to encode aspecific protein. This eliminates any flexibility in other design criteria suchas the elimination or incorporation of restrictions sites, repetitive elementswithin the sequence which can compromise stability, or sequences thatcould form structures at or around the site of translational initiation. Over-use of particular codons may also result in significant amino acid misincor-poration (Kurland and Gallant, 1996), which might compromise thefunction of the protein. Most importantly, however, is that such codonusage may not be optimal for expression. Instead, there is ample empiricalevidence that genes designed using common codons are not correlated withhigh protein expression (Kudla et al., 2009; Welch et al., 2009a) andevidence that in some cases it may be detrimental (Maertens et al., 2010;Welch et al., 2009a; see Fig. 3.4).

The second way in which host codon frequencies can be used is tomatch the host codon frequencies in the designed gene. This can be donesimply by choosing each codon with a probability that matches the hostcodon frequency. Although simple to implement, this does have the limita-tions that probabilistic selection will sometimes result, by pure chance, in agene design where the frequencies of some codons are quite far from thosein the host. This skewing can be exacerbated by subsequent sequence

scFv1

C Exp HiCAl

Bl amylase

C Exp HiCAl

Taq Pol

C Exp HiCAl

scFv C6.5

C Exp HiCAl

Figure 3.4 Comparison of expression of genes coded using experimentally optimized codonusage (“Exp”) or that preferring codons used at highest frequency in naturally highly expressedE. coli genomic genes (“HiCAI”). In each case shown, genes were expressed from a strongrepressible promoter, either T5 or T7, carried on a high copy plasmid. TransformedBL21 cells are cultured Luria broth at 37 �C until mid-log growth (OD at600 nm � 0.6). Expression was induced by addition of IPTG to 1 mM, and cultureswere incubated at 30 �C for 4 h. PAGE analysis was performed on normalized amountsof total culture protein. Gels were stained using Sypro Ruby and imaged by UVfluorescence.

Page 10: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

52 Mark Welch et al.

modification steps, for example, if undesired sequence elements areremoved by removing a codon in an undesired element and replacing itprobabilistically.

3.2.2. Experimental determination of an optimal E. coli codon biasThe ease with which synthetic genes can now be synthesized has allowedresearchers to perform experiments that test previous assumptions regardingoptimal codon bias for heterologous protein expression. Using sets of genesbroadly varied in gene design features, Welch et al. found that variation insynonymous codon usage frequencies had a profound effect on the amountof protein produced in E. coli, independent of local 50 sequence effects.Variation of at least two orders of magnitude in expression was seen due tosubstitution beyond the initial 15 codons of the ORF (Welch et al., 2009a).This variation was strongly correlated with the global codon usage frequen-cies of the genes, although the codon frequencies found in the highestexpressed variants did not correspond to those found in the genome or inhighly expressed endogenous genes of E. coli. Multivariate analysis showedthat the frequencies of specific codons for about six amino acids couldpredict the observed differences in expression. It is not clear what thebiochemical basis is for this correlation. It is possible that it reflects aphysiological shock to the host cells as they attempt to synthesize largeamount of a single protein, biasing the consumption of the aminoacyl-tRNA population: most of the best codons for high expression are alsothose that are predicted to remain more highly charged under starvationconditions (Dittmar et al., 2005; Elf et al., 2003; Welch et al., 2009a).

Regardless of its biochemical basis, the effect of codon frequencies onexpression is not limited to bacteria. Similar results have been obtained inyeast, plant, fungal, and mammalian hosts (Welch, unpublished data. Seehttps://www.dna20.com/index.php?pageID=330). In all cases to date,expression is highly correlated with codon usage but does not show a generalpreference for use of codons used at highest frequency in the genome or inthe highly expressed gene subset of the host. Much further research is neededto fully understand the nature of these effects; however, the observedcorrelations can already serve as the basis for more reliable design algorithmsas well as providing direction for gene improvement strategies.

3.2.3. Designing genes using codon tablesBacktranslation in Gene Designer is performed in two steps. First, thedesign parameters are entered in the backtranslation profile (see Fig. 3.1).Codon bias tables corresponding to the ORFs from the genome of almostany organism can be downloaded easily. First, select from the File menu:Import, then Codon Table (see Fig. 3.5). A dialog box will appear whereyou can enter search criteria. DNA2.0’s Web service will try to match yourcriteria with common and scientific names of species in its database. Once

Page 11: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Figure 3.5 Library Explorer, Project Window, and Import Codon Table dialog box. From theFile menu, choose Import, then Codon Table. In the Import Codon Table dialog box,type in at least three characters as a search criteria. Once the desired table has beenfound, select it and then click on Import. The Library Explorer will open to reveal yournewly imported codon table.

Gene Design and Protein Expression 53

you have found the organism you are looking for, simply select it from theResults list, and click on the Import button. Gene Designer will proceed todownload the codon table and show its new location in the CodonTable Library pane of the Library Explorer. One of these can then be loadedinto a backtranslation profile by dragging it out of the Library Explorer andinto the profile. The program will use the table probabilistically, but it canalso be used to search iteratively within additional constraints to provide asolution that precisely matches the selected table. Additional design criteriafrequently include avoiding specific sequences such as restriction sites,internal RBSs, transcriptional terminators, and RNA splice sites. Thesesequences can be set by selecting Edit under “Sequences to Avoid.” Adialog box with two panes will appear (see Fig. 3.6). The top pane containsthe list of sequences to avoid for the backtranslation profile. The bottompane contains lists corresponding to motifs and restriction sites. To add newsequences to the unwanted list, simply drag and drop them from the bottompane to the top pane. To remove sequences from the unwanted list, you candrag them into the trash can on the right. It is also often desirable to avoid

Page 12: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Figure 3.6 Managing Unwanted Sequences to avoid for a given backtranslation profile. Motifsfor RBS and Shine–Dalgarno, and restriction sites EcoRI andHindIII are already added.XbaI is being dragged in. Sites and motifs may be added via drag and drop.

54 Mark Welch et al.

repeated sequences, both to simplify synthesis and to prevent geneticinstability. The repeat size to avoid is set by a slider in the backtranslationprofile editor. Backtranslation will avoid creating repeats within the ORF; itwill also avoid creating a sequence within the ORF that occurs elsewherewithin the construct.

To address the challenge of searching through the large sequence spaceavailable for evaluation during backtranslation (Welch et al., 2009b), GeneDesigner uses a genetic algorithm that helps to avoid getting trapped insuboptimal local minima. Initially, a population of sequences is generated byrandom selection of codons weighted on codon bias. Then, each individual(sequence) is evaluated against a set of criteria (occurrence of unwanted

Page 13: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 55

sequences, codon bias, occurrence of 50 secondary structure, repetitiveness,homology with specified DNA), each criteria is weighted (see Fig. 3.1), anda score is summed for each individual. Then, new individuals are created bycrossing individuals from the existing population and introducing randommutations. The new individuals are then also evaluated. Finally, the bestindividuals are kept for the next generation of offspring and the cyclecontinues. Parameters for the genetic algorithm such as the maximumnumber of generations, population size, and mutation rate can also be setwithin the backtranslation profile (see Fig. 3.1).

Although empirically optimal codon frequencies may not match thehost’s bias, tables derived from experimental data can also be loaded into aGene Designer backtranslation profile, and are used by the program in thesame way as one prepared by analysis of host sequences.

Genes for other proteins that are designed using data-driven tablesfrequently show similar improvements in expression even when predicted50 mRNA structure is suppressed. In head to head comparisons of genesusing experimentally optimized bias and those using a bias favoring codonsused most frequently in highly expressed host genes, the experimentallyderived bias showed significantly better average yield and consistency (seeTable 3.1 and Fig. 3.4). Among the genes listed in Table 3.1, both versionsfor Bl Amylase, Fs Cutinase, and scFv used identical sequences for the50-UTR and at least 47 bases into the ORF. Thus, differences observed inexpression are not due to initial coding effects or mRNA secondary struc-ture local to that region and must be due to substitutions outside theinitiation region. Among the others, where coding was varied betweenthe versions, no correlation was seen between predicted local mRNAstructure and expression. Clearly synonymous codon usage outside the 50’

Table 3.1 Comparison of genes coded using experimentally optimized codon usage(Exper. Opt) or that preferring codons used at highest frequency in naturally highlyexpressed E. coli genomic genes (HiCAI)

Protein

HiCAI Exper. Opt

Exper/HiCAICAIa mg/mlb CAIa mg/mlb

scFv C6.5 0.90 5 0.71 200 40

Bl Amylase 0.89 50 0.71 200 4

Fs Cutinase 0.89 200 0.71 130 0.7

MCherry 0.91 220 0.68 240 1.1

Taq Pol 0.91 50 0.69 200 4

scFv1 0.88 20 0.69 140 7

NR2B 0.97 5 0.64 100 20

a The gene codon adaptation index as defined by Sharp and Li (1987).b Approximate expression level in one to three E. coli cultures 4 h after induction at 30 �C. See Fig. 3.4legend for expression method details.

Page 14: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

56 Mark Welch et al.

region can have a dramatic effect on expression and simple increased use ofhigh “codon adaptation” codons is not a reliable strategy to maximizetranslation efficiency.

Inherent to an experimental approach to optimization is that solutionsare subject to the idiosyncrasies of the training set. Individual proteins mayhave different sensitivities to codon bias. Expression data from one proteinmight be less useful for guiding the design of a gene for a different proteinsequence, particularly if the amino acid compositions of the two proteins arevery different. For example, the expression levels of an alanine-rich proteinlimited by alanine codon usage but not by serine codon usage will not behelpful in choosing which serine codons to use in a second protein limitedby serine codon usage. With more experimentation to determine prefer-ences for a broad range of protein targets, general and protein-specificdesign rules should emerge.

3.3. mRNA structure and translational elongation

While much evidence suggests that mRNA structure can interfere withtranslational initiation in both prokaryotes and eukaryotes, the effects ofstructure on elongation are less well understood. This in part may be due tointrinsic helicase activity of ribosomes, which allows translation througheven very strong hairpins and may preclude many structures from limitingthe translation rate in either prokaryotes (Takyar et al., 2005) or eukaryotes(Minshull and Hunt, 1986). Perhaps more importantly, mRNA structure isdifficult to predict, particularly for actively translated messages which are incontinuous flux between various folded and unfolded states. Some optimi-zation strategies restrict structure analysis to local windows along themRNA where structure could form between ribosomes, but it is not clearthat such treatments accurately reflect structure in the context of thecomplete mRNA. The current uncertainties in both the impact and theprediction of mRNA structure currently obscure a rational approach tomRNA structure optimization. Any practical consideration of mRNAstructure in gene design will depend on further systematic experimentationto identify reliable principles.

4. Protein-Specific Factors Providing

Additional Complexity

Quite often the target protein itself, due to properties of its structure orits activity, is a strong determinant of expression yield. The protein may beparticularly unstable in the host, especially if it is poorly folded due to

Page 15: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 57

inherent instability, lack of sufficient prosthetic factors, or improper post-translational modification. Expression of the protein may be toxic to the cellleading to instability of the expression vector or host suppression of proteinsynthesis. Expression of secreted and membrane proteins may be limited bymechanisms for directing these proteins to the membrane. It is even possiblethat the protein amino acid sequence may limit translational efficiency. Forexample, proline is thought to be slowly translated in E. coli, regardless ofwhich codon is used (Pavlov et al., 2009). Proteins containing runs ofprolines or high proline content may therefore be intrinsically more difficultto express without either altering their sequence (and thus probably theirfunction) or without some serious tinkering with the process of translationin the host. There exists a growing list of strategies to circumvent protein-specific limitations, some of which are summarized below.

4.1. Protein toxicity

Quite often expression is limited by toxicity of the protein product or sideproducts of attempted expression (Saida, 2007). Toxicity can greatlyincrease plasmid instability if gene expression is not tightly shut downduring cell growth. A strongly repressed promoter and a host geneticbackground that promotes stability can be critical for very toxic genes.Upon induction, high toxicity may lead to a shutdown of protein expres-sion. Optimal expression may require conditions where toxicity is miti-gated. A common strategy to reduce toxicity is to lower expression totolerable levels. Promoters varied in strength can be valuable tools forfinding an optimal expression rate for maximal yield.

As one example, we observed toxicity in trying to express periplasm-directed heavy and light chains of a FAB antibody fragment in E. coli. Use ofstrong T5 and T7 promoters resulted in only poor yields of product, whichwas not efficiently directed to the periplasm. Lowered expression by use of alac promoter reduced toxicity upon induction and increased both final yieldand efficiency of secretion to the periplasm. Indeed, most accumulation inthe periplasm from the lac-driven constructs appeared to occur prior toinduction from this system, which showed measurable amounts of non-induced expression. The lowered expression of the uninduced lac promoterin our constructs perhaps allowed efficient transport to the periplasm with-out accumulation of toxic levels in the cytoplasm.

One potential way to avoid toxicity of some proteins is to directexpression to the periplasm or media. This may be accomplished byN-terminal fusion of a secretion signal sequence. Many such sequenceshave been described for secretion from a wide range of prokaryotic andeukaryotic host cells (Baneyx, 1999; Brake et al., 1984; Korepanova et al.,2009; Peroutka et al., 2008), and several are provided in Gene Designer

Page 16: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

ProtC– + – +

glll_ProtC

Figure 3.7 Expression of ProtC and gIII-fused ProtC in E. coli. Induced (þ) and unin-duced (�) cultures are shown for each variant.

58 Mark Welch et al.

(see Fig. 3.2). In the example shown in Fig. 3.7, fusion of a phage gIII signalsequence to one protein (“ProtC”) proved critical to obtain substantialexpression yield. Attempts to mitigate the high toxicity of this protein bytight promoter control, lowered temperature, and MBP fusion were notsuccessful. However, fusion to the gIII signal sequence reduced toxicitysubstantially and significant yield was obtained.

Intentionally directing proteins to insoluble inclusion bodies using“insolubility tags,” such fusion with ketosteroid isomerase (Park et al.,2008) may also avoid toxicity from the soluble form of proteins, thoughthe general usefulness of this approach is limited to applications whereinsoluble protein is acceptable, for example, in raising antibodies, orwhere successful refolding strategies are known.

4.2. Transmembrane proteins

Transmembrane proteins can be particularly difficult to successfully expressin heterologous hosts (Freigassner et al., 2009). Quite often such proteins arepoorly directed to the membrane and often are toxic to the cell (Luo et al.,2009; Steffensen and Pedersen, 2006; Wagner et al., 2006, 2008). For bothreasons, attenuated expression constructs may prove useful (Wagner et al.,2008). Lowered transcription (e.g., by use of a weaker promoter) may helpto limit the expression rate to that of the membrane insertion capacity of thecell, avoiding accumulation of unfolded protein and potential indirect ordirect toxic effects of overexpression.

Page 17: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 59

4.2.1. Addition of solubility tagsIf the transmembrane portion of the protein is not absolutely required forfunction of the protein (e.g., if the protein is a membrane-bound enzymerather than a transporter), modifications or elimination of the anchor sitecan allow expression and function of the protein in a heterologous system.A good example of this was published in 2008 by Michelle Chang andcolleagues (Chang et al., 2007; Craft et al., 2003). She showed that activeexpression was improved by replacing the N-terminal membrane anchorfor a plant cytochrome P450 with several different sequences, includingP450 sequences from yeast (Craft et al., 2003) or mammals (Barnes et al.,1991), bacterial secretion signals, or synthetic solubilization sequences(Roosild et al., 2005; Schafmeister et al., 1993; Schoch et al., 2003;Sueyoshi et al., 1995). These sequences are preloaded into Gene Designer(see Fig. 3.8).

4.3. cis-Regulatory regions

In some genes, particularly as exemplified in retroviral genes, regulationmay be accomplished by sequence elements within the coding region itself(Kotsopoulou et al., 2000; Woltering and Duboule, 2009). One relativelysimple way to eliminate such motifs is to perform backtranslation whilemaximizing the genetic distance from the sequence found in nature. Incor-porating codon changes where possible will maximize the likelihood ofdisrupting any hidden or unknown elements within the mRNA sequence.This can be accomplished in Gene Designer by specifying a homologousDNA sequence for the Amino Acid Element in question. To do this, openthe AA Element Properties dialog box by selecting an AA Element andclicking on the edit button (pencil icon) or double clicking on the AAElement. Once in the AA Element Properties dialog box (see Fig. 3.9), youcan specify to aim for or avoid similarity with any given DNA sequence.To edit the homologous DNA, click on the Edit Homologous DNAbutton. Gene Designer will then allow you to enter the DNA, will translatesaid DNA into and amino acid sequence, and show the alignment of thetranslated sequence with the sequence of the AA Element you are editing.This alignment is used for calculating the Homologous DNA similarityscore used during backtranslation.

The genetic algorithm in Gene Designer used for backtranslation isdependent on the weights specified in the backtranslation profile (seeFig. 3.1). These weights are used as a means to sort out conflicting require-ments. For example, maximizing sequence similarity with a homologousDNA sequence might conflict with avoidance of a motif that is present inthe homologous DNA. During backtranslation, whichever weighted scores(i.e., Unwanted Sequence Avoidance vs. Homologous DNA Matching)

Page 18: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Figure 3.8 Library Explorer with various fusion tag folders open. Elements from the librarycan be dragged out and into the Project Window to add them to a Design Construct.

60 Mark Welch et al.

contribute more to the overall score will have a stronger effect on the fitnessof each individual of the population and therefore on the general searchdirection in sequence space.

At the end of backtranslation, Gene Designer asks if you would like tosee a Backtranslation Summary Report. Here, you can verify if the

Page 19: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Figure 3.9 Amino Acid Element Properties dialog box. Shown on the left, the propertiesbox has just been opened, and the Amino Acid Element EMP-PD1’s properties can bechanged from here. To edit this element’s homologous DNA, click on Edit Homolo-gous DNA and another box will open, shown on the right. From here, you can specifyDNA which will be translated into an amino acid sequence that will then be alignedwith the Amino Acid Element’s sequence.

Gene Design and Protein Expression 61

sequences you wanted to avoid were truly avoided. The report is alsoavailable under the Reports menu.

5. Conclusions

The promise of synthetic biology depends on understanding thebehavior and interactions between genetic parts, and between those partsand the host system. While there has been great progress in identifyingfundamental genetic elements and standardizing frameworks for the expres-sion and control of genes, we still lack the ability to reliably rationally designgenes for successful expression. This problem is in part one of not fullyunderstanding the nature of the parts. Perhaps more significantly, it isdifficult to predict the impact of protein-specific issues (folding, toxicity,etc.) which could greatly affect expression and optimal gene design. Toolssuch as Gene Designer that facilitate gene engineering will be essential forthe development of reliable synthetic systems.

Page 20: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

62 Mark Welch et al.

REFERENCES

Akashi, H. (2001). Gene expression and molecular evolution. Curr. Opin. Genet. Dev. 11,660–666.

Akashi, H., and Gojobori, T. (2002). Metabolic efficiency and amino acid composition inthe proteomes of Escherichia coli and Bacillus subtilis. Proc. Natl. Acad. Sci. USA 99,3695–3700.

Allert, M., Cox, J. C., and Hellinga, H. W. (2010). Multifactorial determinants of proteinexpression in prokaryotic open reading frames. J. Mol. Biol. 402, 905–918.

Baneyx, F. (1999). Recombinant protein expression in Escherichia coli. Curr. Opin.Biotechnol. 10, 411–421.

Barnes, H. J., Arlotto, M. P., and Waterman, M. R. (1991). Expression and enzymaticactivity of recombinant cytochrome P450 17 alpha-hydroxylase in Escherichia coli. Proc.Natl. Acad. Sci. USA 88, 5597–5601.

Berry, K. E., Waghray, S., and Doudna, J. A. (2010). The HCV IRES pseudoknot positionsthe initiation codon on the 40S ribosomal subunit. RNA 16, 1559–1569.

Brake, A. J., Merryweather, J. P., Coit, D. G., Heberlein, U. A., Masiarz, F. R.,Mullenbach, G. T., Urdea, M. S., Valenzuela, P., and Barr, P. J. (1984). Alpha-factor-directed synthesis and secretion of mature foreign proteins in Saccharomyces cerevisiae.Proc. Natl. Acad. Sci. USA 81, 4642–4646.

Bulmer, M. (1987). Coevolution of codon usage and transfer RNA abundance. Nature 325,728–730.

Chang, M. C., Eachus, R. A., Trieu, W., Ro, D. K., and Keasling, J. D. (2007). EngineeringEscherichia coli for production of functionalized terpenoids using plant P450s. Nat.Chem. Biol. 3, 274–277.

Chen, G., and Inouye, M. (1990). Suppression of the negative effect of minor argininecodons on gene expression; preferential usage of minor codons within the first 25 codonsof the Escherichia coli genes. Nucleic Acids Res. 18, 1465–1473.

Chen, G. T., and Inouye, M. (1994). Role of the AGA/AGG codons, the rarest codons inglobal gene expression in Escherichia coli. Genes Dev. 8, 2641–2652.

Chen, H., Bjerknes, M., Kumar, R., and Jay, E. (1994). Determination of the optimalaligned spacing between the Shine-Dalgarno sequence and the translation initiationcodon of Escherichia coli mRNAs. Nucleic Acids Res. 22, 4953–4957.

Craft, D. L., Madduri, K. M., Eshoo, M., and Wilson, C. R. (2003). Identification andcharacterization of the CYP52 family of Candida tropicalis ATCC 20336, important forthe conversion of fatty acids and alkanes to alpha, omega-dicarboxylic acids. Appl.Environ. Microbiol. 69, 5983–5991.

Cruz-Vera, L. R., Magos-Castro, M. A., Zamora-Romo, E., and Guarneros, G. (2004).Ribosome stalling and peptidyl-tRNA drop-off during translational delay at AGAcodons. Nucleic Acids Res. 32, 4462–4468.

de Smit, M. H., and van Duin, J. (1990). Secondary structure of the ribosome binding sitedetermines translational efficiency: A quantitative analysis. Proc. Natl. Acad. Sci. USA 87,7668–7672.

de Smit, M. H., and van Duin, J. (1994). Control of translation by mRNA secondarystructure in Escherichia coli. A quantitative analysis of literature data. J. Mol. Biol. 244,144–150.

de Smit, M. H., and van Duin, J. (2003). Translational standby sites: How ribosomes maydeal with the rapid folding kinetics of mRNA. J. Mol. Biol. 331, 737–743.

Dittmar, K. A., Sorensen, M. A., Elf, J., Ehrenberg, M., and Pan, T. (2005). Selectivecharging of tRNA isoacceptors induced by amino-acid starvation. EMBO Rep. 6,151–157.

Page 21: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 63

Dong, H., Nilsson, L., and Kurland, C. G. (1996). Co-variation of tRNA abundance andcodon usage in Escherichia coli at different growth rates. J. Mol. Biol. 260, 649–663.

Elf, J., Nilsson, D., Tenson, T., and Ehrenberg, M. (2003). Selective charging of tRNAisoacceptors explains patterns of codon usage. Science 300, 1718–1722.

Eyre-Walker, A. (1996). Synonymous codon bias is related to gene length in Escherichiacoli: Selection for translational accuracy? Mol. Biol. Evol. 13, 864–872.

Eyre-Walker, A., and Bulmer, M. (1993). Reduced synonymous substitution rate at the startof enterobacterial genes. Nucleic Acids Res. 21, 4599–4603.

Eyre-Walker, A., and Bulmer, M. (1995). Synonymous substitution rates in enterobacteria.Genetics 140, 1407–1412.

Freigassner, M., Pichler, H., and Glieder, A. (2009). Tuning microbial hosts for membraneprotein production. Microb. Cell Fact. 8, 69.

Gazo, B. M., Murphy, P., Gatchel, J. R., and Browning, K. S. (2004). A novel interaction ofCap-binding protein complexes eukaryotic initiation factor (eIF) 4F and eIF(iso)4F witha region in the 30-untranslated region of satellite tobacco necrosis virus. J. Biol. Chem.279, 13584–13592.

Gonzalez de Valdivia, E. I., and Isaksson, L. A. (2004). A codon window in mRNAdownstream of the initiation codon where NGG codons give strongly reduced geneexpression in Escherichia coli. Nucleic Acids Res. 32, 5198–5205.

Gonzalez de Valdivia, E., and Isaksson, L. A. (2005). Abortive translation caused by peptidyl-tRNA drop-off at NGG codons in the early coding region of mRNA. FEBS J. 272,5306–5316.

Griswold, K. E., Mahmood, N. A., Iverson, B. L., and Georgiou, G. (2003). Effects of codonusage versus putative 50-mRNA structure on the expression of Fusarium solani cutinasein the Escherichia coli cytoplasm. Protein Expr. Purif. 27, 134–142.

Gustafsson, C., Govindarajan, S., and Minshull, J. (2004). Codon bias and heterologousprotein expression. Trends Biotechnol. 22, 346–353.

Hammarstrom, M., Hellgren, N., van Den Berg, S., Berglund, H., and Hard, T. (2002).Rapid screening for improved solubility of small human proteins produced as fusionproteins in Escherichia coli. Protein Sci. 11, 313–321.

Hoekema, A., Kastelein, R. A., Vasser, M., and de Boer, H. A. (1987). Codon replacementin the PGK1 gene of Saccharomyces cerevisiae: Experimental approach to study the roleof biased codon usage in gene expression. Mol. Cell. Biol. 7, 2914–2924.

Holm, L. (1986). Codon usage and gene expression. Nucleic Acids Res. 14, 3075–3087.Kanaya, S., Yamada, Y., Kinouchi, M., Kudo, Y., and Ikemura, T. (2001). Codon usage and

tRNA genes in eukaryotes: Correlation of codon usage diversity with translation effi-ciency and with CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol.53, 290–298.

Knight, R. D., Freeland, S. J., and Landweber, L. F. (2001). A simple model based onmutation and selection explains trends in codon and amino-acid usage and GC composi-tion within and across genomes. Genome Biol. 2RESEARCH0010.

Komarova, A. V., Tchufistova, L. S., Supina, E. V., and Boni, I. V. (2002). Protein S1counteracts the inhibitory effect of the extended Shine-Dalgarno sequence on translation.RNA 8, 1137–1147.

Komarova, A. V., Tchufistova, L. S., Dreyfus, M., and Boni, I. V. (2005). AU-richsequences within 50 untranslated leaders enhance translation and stabilize mRNA inEscherichia coli. J. Bacteriol. 187, 1344–1349.

Korepanova, A., Moore, J. D., Nguyen, H. B., Hua, Y., Cross, T. A., and Gao, F. (2007).Expression of membrane proteins from Mycobacterium tuberculosis in Escherichia colias fusions with maltose binding protein. Protein Expr. Purif. 53, 24–30.

Korepanova, A., Pereda-Lopez, A., Solomon, L. R., Walter, K. A., Lake, M. R.,Bianchi, B. R., McDonald, H. A., Neelands, T. R., Shen, J., Matayoshi, E. D.,

Page 22: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

64 Mark Welch et al.

Moreland, R. B., and Chiu, M. L. (2009). Expression and purification of human TRPV1in baculovirus-infected insect cells for structural studies. Protein Expr. Purif. 65, 38–50.

Kotsopoulou, E., Kim, V. N., Kingsman, A. J., Kingsman, S. M., and Mitrophanous, K. A.(2000). A Rev-independent human immunodeficiency virus type 1 (HIV-1)-basedvector that exploits a codon-optimized HIV-1 gag-pol gene. J. Virol. 74, 4839–4852.

Kozak, M. (1986). Influences of mRNA secondary structure on initiation by eukaryoticribosomes. Proc. Natl. Acad. Sci. USA 83, 2850–2854.

Kozak, M. (1999). Initiation of translation in prokaryotes and eukaryotes. Gene 234,187–208.

Kozak, M. (2005). Regulation of translation via mRNA structure in prokaryotes andeukaryotes. Gene 361, 13–37.

Kudla, G., Murray, A. W., Tollervey, D., and Plotkin, J. B. (2009). Coding-sequencedeterminants of gene expression in Escherichia coli. Science 324, 255–258.

Kurland, C., and Gallant, J. (1996). Errors of heterologous protein expression. Curr. Opin.Biotechnol. 7, 489–493.

Lee, K., Holland-Staley, C. A., and Cunningham, P. R. (1996). Genetic analysis of theShine-Dalgarno interaction: Selection of alternative functional mRNA-rRNA combi-nations. RNA 2, 1270–1285.

Lisser, S., and Margalit, H. (1993). Compilation of E. coli mRNA promoter sequences.Nucleic Acids Res. 21, 1507–1516.

Luo, J., Choulet, J., and Samuelson, J. C. (2009). Rational design of a fusion partner formembrane protein expression in E. coli. Protein Sci. 18, 1735–1744.

Maertens, B., Spriestersbach, A., von Groll, U., Roth, U., Kubicek, J., Gerrits, M., Graf, M.,Liss, M., Daubert, D., Wagner, R., and Schafer, F. (2010). Gene optimization mechan-isms: A multi-gene study reveals a high success rate of full-length human proteinsexpressed in Escherichia coli. Protein Sci. 19, 1312–1326.

Marquez, R., Smit, S., and Knight, R. (2005). Do universal codon-usage patterns minimizethe effects of mutation and translation error? Genome Biol. 6, R91.

Minshull, J., and Hunt, T. (1986). The use of single-stranded DNA and RNase H topromote quantitative ‘hybrid arrest of translation’ of mRNA/DNA hybrids in reticulo-cyte lysate cell-free translations. Nucleic Acids Res. 14, 6433–6451.

Na, D., Lee, S., and Lee, D. (2010). Mathematical modeling of translation initiation for theestimation of its efficiency to computationally design mRNA sequences with desiredexpression levels in prokaryotes. BMC Syst. Biol. 4, 71.

Park, T. J., Choi, S. S., Gang, G. A., and Kim, Y. (2008). High-level expression andpurification of the second transmembrane domain of wild-type and mutant humanmelanocortin-4 receptor for solid-state NMR structural studies. Protein Expr. Purif. 62,139–145.

Parsyan, A., Shahbazian, D., Martineau, Y., Petroulakis, E., Alain, T., Larsson, O.,Mathonnet, G., Tettweiler, G., Hellen, C. U., Pestova, T. V., Svitkin, Y. V., andSonenberg, N. (2009). The helicase protein DHX29 promotes translation initiation,cell proliferation, and tumorigenesis. Proc. Natl. Acad. Sci. USA 106, 22217–22222.

Pavlov, M. Y., Watts, R. E., Tan, Z., Cornish, V. W., Ehrenberg, M., and Forster, A. C.(2009). Slow peptide bond formation by proline and other N-alkylamino acids intranslation. Proc. Natl. Acad. Sci. USA 106, 50–54.

Peccoud, J., Blauvelt, M. F., Cai, Y., Cooper, K. L., Crasta, O., DeLalla, E. C., Evans, C.,Folkerts, O., Lyons, B. M., Mane, S. P., Shelton, R., Sweede, M. A., et al. (2008).Targeted development of registries of biological parts. PLoS One 3, e2671.

Peroutka, R. J., Elshourbagy, N., Piech, T., and Butt, T. R. (2008). Enhanced proteinexpression in mammalian cells using engineered SUMO fusions: Secreted phospholipaseA2. Protein Sci. 17, 1586–1595.

Page 23: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

Gene Design and Protein Expression 65

Pestova, T. V., Kolupaeva, V. G., Lomakin, I. B., Pilipenko, E. V., Shatsky, I. N.,Agol, V. I., and Hellen, C. U. (2001). Molecular mechanisms of translation initiationin eukaryotes. Proc. Natl. Acad. Sci. USA 98, 7029–7036.

Pisareva, V. P., Pisarev, A. V., Komar, A. A., Hellen, C. U., and Pestova, T. V. (2008).Translation initiation on mammalian mRNAs with structured 50UTRs requires DExH-box protein DHX29. Cell 135, 1237–1250.

Preiss, T., and Hentze, M. W. (1999). From factors to mechanisms: Translation andtranslational control in eukaryotes. Curr. Opin. Genet. Dev. 9, 515–521.

Rocha, E. P. (2004). Codon usage bias from tRNA’s point of view: Redundancy, speciali-zation, and efficient decoding for translation optimization. Genome Res. 14, 2279–2286.

Roosild, T. P., Greenwald, J., Vega, M., Castronovo, S., Riek, R., and Choe, S. (2005).NMR structure of Mistic, a membrane-integrating protein for membrane proteinexpression. Science 307, 1317–1321.

Saida, F. (2007). Overview on the expression of toxic gene products in Escherichia coli.Curr. Protoc. Protein Sci. Chapter 5, Unit 5 19.

Salis, H. M., Mirsky, E. A., and Voigt, C. A. (2009). Automated design of syntheticribosome binding sites to control protein expression. Nat. Biotechnol. 27, 946–950.

Schafmeister, C. E., Miercke, L. J., and Stroud, R. M. (1993). Structure at 2.5 A of adesigned peptide that maintains solubility of membrane proteins. Science 262, 734–738.

Schoch, G. A., Attias, R., Belghazi, M., Dansette, P. M., and Werck-Reichhart, D. (2003).Engineering of a water-soluble plant cytochrome P450, CYP73A1, and NMR-basedorientation of natural and alternate substrates in the active site. Plant Physiol. 133,1198–1208.

Sharp, P. M., and Li, W. H. (1987). The codon Adaptation Index—Ameasure of directionalsynonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15,1281–1295.

Sharp, P. M., Cowe, E., Higgins, D. G., Shields, D. C., Wolfe, K. H., and Wright, F.(1988). Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cere-visiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens:A review of the considerable within-species diversity. Nucleic Acids Res. 16, 8207–8211.

Sharp, P. M., Bailes, E., Grocock, R. J., Peden, J. F., and Sockett, R. E. (2005). Variation inthe strength of selected codon usage bias among bacteria. Nucleic Acids Res. 33,1141–1153.

Shultzaberger, R. K., Bucheimer, R. E., Rudd, K. E., and Schneider, T. D. (2001).Anatomy of Escherichia coli ribosome binding sites. J. Mol. Biol. 313, 215–228.

Smyth, D. R., Mrozkiewicz, M. K., McGrath, W. J., Listwan, P., and Kobe, B. (2003).Crystal structures of fusion proteins with large-affinity tags. Protein Sci. 12, 1313–1322.

Steffensen, L., and Pedersen, P. A. (2006). Heterologous expression of membrane andsoluble proteins derepresses GCN4 mRNA translation in the yeast Saccharomycescerevisiae. Eukaryot. Cell 5, 248–261.

Stenstrom, C. M., and Isaksson, L. A. (2002). Influences on translation initiation and earlyelongation by the messenger RNA region flanking the initiation codon at the 30 side.Gene 288, 1–8.

Stenstrom, C. M., Holmgren, E., and Isaksson, L. A. (2001a). Cooperative effects by theinitiation codon and its flanking regions on translation initiation. Gene 273, 259–265.

Stenstrom, C. M., Jin, H., Major, L. L., Tate, W. P., and Isaksson, L. A. (2001b). Codon biasat the 30-side of the initiation codon is correlated with translation initiation efficiency inEscherichia coli. Gene 263, 273–284.

Studer, S. M., and Joseph, S. (2006). Unfolding of mRNA secondary structure by thebacterial translation initiation complex. Mol. Cell 22, 105–115.

Page 24: Chapter 3 - Designing Genes for Successful Protein Expression - …mpec.ucsf.edu/pdfs_new/Pubs_77.pdf · De novo gene synthesis is an increasingly cost-effective method for building

66 Mark Welch et al.

Sueyoshi, T., Park, L. J., Moore, R., Juvonen, R. O., and Negishi, M. (1995). Molecularengineering of microsomal P450 2a-4 to a stable, water-soluble enzyme. Arch. Biochem.Biophys. 322, 265–271.

Suzuki, H., Brown, C. J., Forney, L. J., and Top, E. M. (2008). Comparison of correspon-dence analysis methods for synonymous codon usage in bacteria.DNARes. 15, 357–365.

Takyar, S., Hickerson, R. P., and Noller, H. F. (2005). mRNA helicase activity of theribosome. Cell 120, 49–58.

Tuller, T., Waldman, Y. Y., Kupiec, M., and Ruppin, E. (2010). Translation efficiency isdetermined by both codon bias and folding energy. Proc. Natl. Acad. Sci. USA 107,3645–3650.

Vimberg, V., Tats, A., Remm, M., and Tenson, T. (2007). Translation initiation regionsequence preferences in Escherichia coli. BMC Mol. Biol. 8, 100.

Wagner, S., Bader, M. L., Drew, D., and de Gier, J. W. (2006). Rationalizing membraneprotein overexpression. Trends Biotechnol. 24, 364–371.

Wagner, S., Klepsch, M. M., Schlegel, S., Appel, A., Draheim, R., Tarry, M., Hogbom, M.,van Wijk, K. J., Slotboom, D. J., Persson, J. O., and de Gier, J. W. (2008). TuningEscherichia coli for membrane protein overexpression. Proc. Natl. Acad. Sci. USA 105,14371–14376.

Welch, M., Govindarajan, S., Ness, J. E., Villalobos, A., Gurney, A., Minshull, J., andGustafsson, C. (2009a). Design parameters to control synthetic gene expression inEscherichia coli. PLoS One 4, e7002.

Welch, M., Villalobos, A., Gustafsson, C., and Minshull, J. (2009b). You’re one in a googol:Optimizing genes for protein expression. J. R. Soc. Interface 6(Suppl 4), S467–S476.

Woltering, J. M., and Duboule, D. (2009). Conserved elements within open reading framesof mammalian Hox genes. J. Biol. 8, 17.

Yang, Z., and Nielsen, R. (2008). Mutation-selection models of codon substitution and theiruse to estimate selective strengths on codon usage. Mol. Biol. Evol. 25, 568–579.