eindhoven university of technology master analyzing ribosome … · codons instruct the ribosomes...

Eindhoven University of Technology

MASTER

Analyzing ribosome processivity using EPT

Blewanus, R.

Award date:2011

Link to publication

DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

https://research.tue.nl/en/studentTheses/c50c441d-4c11-41c6-9745-9516ce13ef6d

Eindhoven University of TechnologyDepartment of Mathematics and Computer Science

Master’s thesis

Analyzing ribosome processivity using EPT

by

Remco Blewanus

Supervisor:dr. E.P. de Vink

June 29, 2011

Abstract

Ribosome processivity is part of the fundamental biological process of translation. It pertains tothe progress of ribosomes along the mRNA when constructing proteins. This progress is hard topredict as numerous factors can negatively influence the time required by a ribosome to constructa specific protein. It is possible to construct an elaborate model of translation, but a state-spaceexplosion limits the practicality of using modelchecking based techniques.

Our research applies the Effective Process Time (EPT) approach to the process of translation.This enables to analyze ribosome processivity without constructing an elaborate model. EPTis usually applied to logistic processes in order to obtain estimates for their characteristics asthroughput and average delay. As the process of translation does not conform to the assumptionsEPT imposes on logistic processes, we needed to make a number of adaptations before being ableto use it to analyze ribosome processivity.

We conducted a series of experiments using our adapted EPT approach. These experimentsprovided insight into how the approach behaves in the context of translation. Furthermore, itwas also observed that EPT produces unreliable estimates for long genes. However, the error inthe estimates produced by EPT is systematic. As a result, it is possible to correct the estimatesreturned by EPT, reducing them to an acceptable margin with respect to reference values obtainedthrough simulations.

Acknowledgements

This thesis is written as part of my master’s project at Eindhoven University of Technology.The project is the final project of my studies in Computer Science & Engineering and it wasperformed at the Department of Mathematics and Computer Science. Performing the projectgave me the opportunity to apply a wide range of knowledge learned throughout my years as astudent. Furthermore, the project also led to insight into myself and contributed to my abilitiesas an engineer.

First I would like to thank my supervisor Erik de Vink for all his valuable support duringthe project. He helped me to struggle with difficult problems and presented feedback on bothmy activities and the project itself. My gratitude also goes to Ivo Adan who helped me with myquestions regarding mathematics and probability theory. Furthermore, I am also grateful for theaid of Dragan Bosnacki during my preceding seminar.

My special thanks go out to my parents who encouraged me throughout the years and gave methe opportunity to finish my studies. Finally, I would like to thank all my friends and especiallyHerman for all the pleasant conversations over the past few months.

Eindhoven,June 13, 2011

Remco Blewanus

i

Contents

1 Introduction 2

2 Biological background 3

3 Effective Process Time 63.1 Main concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Analyzing the arrival servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Analyzing the departure servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Modeling translation with EPT 184.1 Modelchecking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Adapting the EPT approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Adapted algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Experiments 265.1 Constructing input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2 Generic experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 Pathway experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusions 41

A Codon rates 44

B Pathway experiment results 46

C Adapted EPT algorithm pseudocode 53

D Mathematical operations 56

E Matlab code of the adapted EPT algorithm 60

1

Chapter 1

Introduction

This thesis presents the results of our research into analyzing ribosome processivity using EPT.Ribosome processivity is part of the biological process of translation. Translation occurs in everyorganism and is vital to the survival of each biological cell. In fact, the cell relies on translation inorder to produce proteins that engage in chemical reactions which aid its development and growth.Without translation, a cell cannot utilize its genetic information that encodes for the variety oftasks it needs to perform. Therefore, the process of translation is one of the processes that isconsidered vital to life.

Being able to analyze ribosome processivity requires a thorough understanding of the processof translation. Fortunately, a detailed model of translation is available in [3]. This model detailsevery reaction that is relevant to the process of translation. However, understanding the reactionsthemselves is not sufficient to analyze ribosome processivity. As translation is essentially a pro-duction process yielding proteins, it can suffer from a variety of problems that are similar to thoseencountered in production plants. These problems make it difficult to predict the time requiredto create proteins, which is exactly what we aim to do by analyzing ribosome processivity.

We aim to analyze ribosome processivity by making use of EPT. EPT is an approach thatallows to infer estimates for certain characteristics of logistic processes. The approach is usuallyapplied to production lines, but the novelty of our research is that we apply it to the biologicalprocess of translation. However, the approach imposes several assumptions on the productionlines it operates upon. Therefore, several adaptations need to be made in order to overcome theselimitations. Once the approach is adapted to the context of translation, we apply it to severalcase studies involving real genetic data. The results of these case studies allow us to infer whetherEPT can successfully be used to analyze ribosome processivity. Hence, the main research questionof this thesis is as follows:

Question 1. Can the EPT approach be used to analyze ribosome processivity?

Furthermore, the case studies also serve to provide insight into the accuracy of the EPTapproach. This is interesting on its own right as we apply EPT in a way that is not originallyintended.

Outline

This thesis starts by giving an overview of the biological background in Chapter 2. This overviewintroduces the process of translation and all aspects that are relevant to our analysis of ribosomeprocessivity. Chapter 3 introduces the EPT approach that is used as a basis for our research.The adaptations to this initial EPT approach are presented in Chapter 4, after which it is appliedin the case studies presented in Chapter 5. Finally, Chapter 6 presents the conclusions that aredrawn from performing our research.

2

Chapter 2

Biological background

This chapter introduces the biological background of our research. Our research concerns modelingthe biological process of translation of mRNA, which is one of the steps in the creation of proteins.Proteins carry out the basic tasks that are encoded for in the genetic material of the biologicalcell. This genetic material consists of DNA that is contained in the nucleus of the cell. DNAis a long polymer that consists of repeated units: so-called nucleotides. There are four differentkinds of nucleotides in total: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). A usefulproperty of these nucleotides is that they can form bonds with each other in pairs: A can bindwith T and vice versa, and C can bind with G and vice versa. This enables easy duplication ofa strand by constructing a templated copy. The first templated copy of the DNA is constructedin a process called transcription, which can be seen in Figure 2.1. The templated copy is knownas mRNA and is constructed by a specialized enzyme called RNA polymerase. Being a copy, themRNA conveys the same information as the DNA. As a minor detail, it must be mentioned thatin (m)RNA the nucleotide Thymine is replaced by Uracil (U).

DNA

RNA

Protein

Transcription

Translation

Figure 2.1: Transcription and translation inside the cell.

Each strand of mRNA is a blueprint for a specific protein. As proteins consist of chains ofamino acids, a strand of mRNA encodes for the particular amino acids making up the protein.This encoding is implemented by mapping groups of three nucleotides, called codons, to an aminoacid. In total, there are 64 different codons. All but three special codons are mapped to specificamino acids and some amino acids are encoded for by multiple codons. Special start and stopcodons instruct the ribosomes where to start and stop reading the strand.

A strand of mRNA can be thought of as a linear array of codons that encodes for a particularprotein. In order to construct a protein, ribosomes move along the strand and concatenate aminoacids according to the codons in the strand. This process is called translation and is performed byribosomes, which are special units inside the cytoplasm of the cell. Figure 2.1 shows an overviewof this process. During translation, ribosomes always arrive at a specific end of the strand andprogress along the strand towards the opposite end. The basic task of a ribosome is to matcheach codon in the strand with a so-called anticodon. An anticodon contains nucleotides that areopposing the nucleotides of its respective codon, enabling an easy bond between the two groups

3

of nucleotides. Anticodons are attached to aa-tRNAs, which are pieces of tRNA that reside inthe cytoplasm of the cell. Each aa-tRNA has one anticodon, as well as an attached amino acid.This amino acid is specific to the anticodon carried by the aa-tRNA. Ribosomes progress alongthe strand of mRNA one codon at a time. Whenever a ribosome reads a new codon, it waits untilan aa-tRNA arrives that contains a matching anticodon. The ribosome then accepts this newaa-tRNA and adds the attached amino acid to the growing chain of amino acids that is to becomethe new protein. Figure 2.2 shows this process.

U G G A A A G A U U U C

tRNA tRNA

tRNA

U U U UC A

A A G

AspLys

Phe

Trp

tRNA

A C C

mRNA

Ribosome

Direction of movement

Figure 2.2: Ribosome progressing along a strand of mRNA from left to right.

A fine-grained model of the workings of a ribosome is presented in [3] and [5]. For the sakeof completeness, the relevant steps in the model are repeated below. The steps dealing with theactual chemical reactions that are irrelevant to our model are left out. Figure 2.3 shows a graphicaloverview of the relevant steps. In general, one can distinguish two phases the ribosome progressesthrough when it has arrived at a new codon: peptidyl transfer and translocation. Peptidyl transferconsists of the following steps:

1. An aa-tRNA arrives at the ribosome by means of diffusion.

2. The ribosome establishes contact between the codon and the anticodon of the aa-tRNA.

3. The ribosome exhibits conformational changes based on the type of anticodon. An anticodon(or aa-tRNA) can be either cognate, near-cognate or non-cognate. Cognate anticodonshave different conformational rates than near-cognates. Non-cognate anticodons yield noconformational change and dissociate from the ribosome almost immediately.

4. The ribosome experiences a number of chemical reactions to accommodate for the newlyrecognized aa-tRNA. This happens rapidly for cognate aa-tRNA and slower for near-cognateaa-tRNA.

5. The ribosome attaches the protein that is being constructed to the amino acid that originatesfrom the newly recognized aa-tRNA.

During the peptidyl transfer phase, the distinction between cognate, near-cognate and non-cognate aa-tRNA becomes apparent. Cognate aa-tRNA (or cognate anticodons, the term appliesto both species) contains an anticodon that corresponds to the codon the ribosome currently resideson, enabling a strong binding between the two. This is in contrast with near-cognate aa-tRNA,which yields a weaker binding. Furthermore, near-cognate aa-tRNA does not contain the rightamino acid that is encoded for by the codon. Since the binding of near-cognate aa-tRNA is strongenough to sometimes result in incorporation of the corresponding amino acid into the protein,the result of binding near-cognate aa-tRNA is erroneous elongation of the new protein. This is

4

different for non-cognate aa-tRNA, as this type of aa-tRNA binds very poorly with the codon.Hence, binding non-cognate aa-tRNA does not result in elongation of the new protein. Erroneouselongation of a protein can result in different chemical properties. It depends on the severity of theerror whether the protein can be used in a normal way or it cannot perform its intended function.Note that erroneous elongation is not taken into account in the biological model used as a basisfor our research.

The final phase a ribosome progresses through when it has arrived at a new location, is thetranslocation phase. During this phase the ribosome moves forward one codon. As the new proteinis now attached to the newly recognized aa-tRNA, the old aa-tRNA it was attached to previouslydissociates from the ribosome.

Initial bindingRecognition &Conformationalchanges

Peptidyl transfer Translocation

Figure 2.3: Peptidyl transfer and translocation phases.

The process of translation is one of the central steps in the creation of proteins. As our researchis focussed on analyzing ribosome processivity, it is important to take into account the biologicalmodel that has been introduced in this section. We are interested in predicting the time requiredto create specific proteins. Hence, our approach must account for the fact that ribosomes movealong a strand of mRNA with a rate that is dependent on the codons in the strand. Furthermore,it must also be taken into account that the ribosome progresses through a peptidyl transfer phaseand a translocation phase for each of the codons encountered.

5

Chapter 3

Effective Process Time

This chapter introduces the Effective Process Time (EPT) approach. The EPT approach can beapplied to logistic processes in order to obtain estimates for their throughput and delay. It is ageneric method that can be used to analyze a wide range of production lines. For example, in [6]the EPT approach is used to analyze production lines of up to 12 machines in a car manufacturingplant. Our research into ribosome processivity makes use of the EPT approach in order to predictthe time required to create proteins. However, before applying the EPT approach to the processof translation, this chapter presents an overview of the general EPT approach.

Section 3.1 introduces the main concepts of EPT. It describes the setting in which the approachcan be applied, as well as how it models the production line on which it operates. Section 3.2introduces basic properties of the EPT approach that are used in later sections. Finally, sections3.3 and 3.4 present an analysis of the depedencies between the stochastic variables as used bythe EPT approach. This analysis is used by the eventual algorithm to iteratively determine newestimations for the throughput and delay of the production line.

3.1 Main concepts

This section introduces the main concepts of the EPT approach. As mentioned in the introductionof this chapter, the EPT approach can be used to analyze the throughput and delay of a logisticprocess. A logistic process is a process in which products flow through a production line. Theproducts receive an operation at each machine in the line and the duration of this operation isdependent on the characteristics of the particular machine. A product is considered finished whenit has received an operation from all machines in the production line. As the duration of eachoperation is dependent on the characteristics of a machine, it is not trivial to estimate the totaltime it takes to create a finished product. Certain effects, such as blocking and starvation, canhave an impact on the throughput and delay of the production line. The EPT approach aims toestimate these characteristics while taking into account blocking and starvation. Previous workalready demonstrated that these estimates are accurate under a number of assumptions [6].

The EPT approach that is used as a basis for our research originates from the work of P.W.Frenken [6]. In his master’s thesis, he extends the previous work of Van Vuuren [8]. Similar toVan Vuuren, Frenken uses a decomposition algorithm in his EPT approach. However, Frenkenextends the algorithm with a regenerative method that allows to account for offsets in the servicetimes of the machines in the line. Furthermore, his regenerative method yields better estimationsand thus more accurate results. Before discussing Frenken’s approach, it is necessary to define theconcept of EPT.

The EPT is defined as the total time that a product claims capacity of a machine from alogistical point of view [6]. This includes all events that could cause a product to claim capacity ofa machine. Therefore, it is not necessary to distinguish whether a product is waiting at a machine(e.g. because the next machine in the production line is not ready to receive new input) or is

6

actually being processed by it. Thus, the EPT is an abstraction of all the capacity claimed by aproduct. By using this abstraction, the EPT approach can efficiently calculate estimates for thethroughput and delay of the entire logistic process. These estimates can be obtained considerablyfaster than by discrete event simulations, which explicitly simulate every event occurring in thelogistic process.

The input of the EPT approach consists of a description of a production line. For our purposes,only linear tandem lines need to be considered. Such a linear tandem line consists of a number ofmachines that operate on a product in a sequential manner. Furthermore, buffering products inbetween the machines is not allowed. Although it is possible to use more complicated productionlines (assembly lines) that incorporate buffering [6], these features are not required for analyzingthe process of translation. Figure 3.1 shows an example of a tandem line that consists out of Kmachines Mi, for 1 ≤ i ≤ K. For each machine Mi, the EPT approach requires a descriptionof the service time, which is denoted by Si. This description is provided by stating the meanand variance of the service time for each Mi, denoted by E[Si] and Var(Si) respectively. Thesecharacteristics can typically be collected via measurements. Given E[Si] and Var(Si), the approachfits a phase-type distribution on the service time Si. This enables the approach to infer the rateof Mi, as well as the distribution associated with its service time.

...M1 M2 M3 M4 MK

Figure 3.1: A tandem line consisting of K machines.

Now that the initial characteristics of the machines in the tandem line are known, it is possibleto use the EPT approach to estimate the throughput and delay of the production line. The firststep consists of performing a decomposition of the line into subsystems. Without this decom-position, it would be hard to estimate the total throughput and delay for the entire productionline. This is due to the stochastic behaviour of the machines in the line. However, decomposingthe production line into smaller subsystems makes this problem manageable. Once the line isdecomposed into subsystems, the approach associates stochastic variables with each subsystem.These variables are used by the approach to infer the throughput and delay of each subsystem.Table 3.1 gives an overview of these stochastic variables.

Variable Description

Mi Machine i in a tandem line.

Si The service time of Mi.

Li Subsystem Li, consisting of machines Mi and Mi+1.

Ai/Ai,j Arrival server of Li.1

Di/Di,j Departure server of Li.

Wi/Wi,j Additional waiting time of Ai.

Bi/Bi,j Additional blocking time of Di.

Ci,j The length of the jth cycle of Li.

Ti,j The throughput of Li in cycle j.1 Subscript j denotes the value during the jth cycle.

Table 3.1: Stochastic variables used by the EPT approach.

Assume that the tandem line on which the EPT approach operates consists ofK machines. Thismeans it is decomposable into K− 1 subsystems when each subsystem consists of two consecutivemachines. Figure 3.2 shows the result of applying the decomposition. It can be seen that eachsubsystem has an overlap with both its preceding subsystem and its subsequent subsystem. Theonly exceptions are the first and last subsystems. Subsystems are denoted by Li, for 1 ≤ i ≤ K−1.

7

In general, it holds that Li consists of machines Mi and Mi+1.

...M3 M4 Mi Mi+1 MK−1MK−2M1 M2

L1

L2

L3

Li

LK−1

...

...

...

M1 M2

M3M2

M3 M4

Mi Mi+1

MK−1MK−2

Figure 3.2: Decomposition of a tandem line.

The tandem line is decomposed into subsystems in order to limit the analysis of the throughputand delay to two servers at a time. However, it is still required to capture the stochastic behaviourof the entire line inside each subsystem. In order to do so, the two machines in each subsystemserve a different purpose. The first machine in Li is the so-called arrival server of Li. Similarly,the last machine in Li is the departure server of Li. The arrival server of Li captures the servicetime of machine Mi, including possible waiting time it needs in order to receive new input fromMi−1. This enables the approach to capture the behaviour of all preceding machines in therepresentation of Li. Similarly, the departure server of Li captures the service time of machineMi+1, including possible blocking time it incurs when Mi+2 is not ready to receive new input.Thus, the representation of Li also captures the behaviour of all subsequent machines. Figure 3.3shows the servers associated with each subsystem. Furthermore, the approach associates stochasticvariable Ai with the arrival server of Li. Similarly, stochastic variable Di is associated with thedeparture server of Li. Both Ai and Di are used by the approach to infer the duration of the totalservice time of the corresponding servers. Recall that the service time of Mi is given by Si. Whenwe denote the possible waiting time of Mi by Wi and blocking by Bi, we obtain the followingrepresentation for subsystem Li:

Ai = Wi + Si (3.1)

Di = Si+1 +Bi (3.2)

The EPT approach uses the values of Ai and Di for all 1 ≤ i ≤ K − 1 to estimate thethroughput and delay of the entire production line. Initially, these values are based on the servicetimes as specified by each Si. Once the EPT algorithm is running, it updates these values in aniterative way, yielding better approximations for Ai and Di in each iteration. An analysis of howthe values for Ai and Di are determined is presented in sections 3.3 and 3.4. Note that, from thispoint onward, the arrival server of Li is denoted by Ai and the departure server of Li by Di.

Mi Mi+1Li

Ai Di

Figure 3.3: Servers in subsystem Li.

Before it is possible to state a number of basic properties of the EPT approach, it is necessaryto introduce the concept of a cycle. A cycle of a subsystem is defined as the time it takes that

8

subsystem to transfer a product from its arrival server to its departure server. More formally,a cycle of Li starts when Ai delivers a product to Di, after which Ai starts working on a newproduct. Once the newly processed product has been transferred to Di, the cycle is complete.Since Ai cannot deliver a product to Di when Di is still busy processing a preceding product,the length of a cycle is defined as the maximum of Ai and Di. This also implies that there is nobuffering between servers. When Ci,j denotes cycle j of Li, we have the following representationfor Ci,j :

Ci,j = max{Ai,j , Di,j} (3.3)

In this last equation, Ai,j and Di,j respectively denote the values of Ai and Di in cycle j. Anexample of how Ai,j and Di,j relate to Ci,j can be seen in Figure 3.4. When all values for Ai,jand Di,j are known for cycle j, the approach can infer the throughput of each subsystem. Thethroughput of Li in cycle j is denoted by Ti,j . It holds that Ti,j = 1

E[Ci,j ]. Since Ci,j converges

when applying the EPT algorithm for an increasing number of cycles, Ti,j converges as well. Thealgorithm uses this convergence to determine when the estimates for the throughput and delayare accurate enough and the algorithm can be stopped.

Time

Ai,1 Ai,2 Ai,3 Ai,4

Di,1 Di,2 Di,3 Di,4

Ci,1 Ci,2 Ci,3 Ci,4

Ai Di Ai Di Ai Di Ai Di Ai Di

Figure 3.4: Example showing four cycles of Li.

3.2 Properties

Now that the main concepts of EPT have been introduced, it remains to elaborate on someproperties that are used by the approach. These properties are used throughout the analysis ofthe stochastic variables as presented in sections 3.3 and 3.4. Although the properties are directcorollaries of the definitions presented in Section 3.1, it is important to understand them beforemoving on to the analysis of the stochastic variables. The two properties are as follows:

Property 3.1. Both Ai,j and Di,j start at the end of Ci,j−1.

This property is a direct consequence of the definition of Ci,j . Recall that equation 3.3 statesthat the length of Ci,j−1 is equal to the maximum of Ai,j−1 and Di,j−1. This means that two casescan be distinguished: Ai,j−1 > Di,j−1 and Ai,j−1 ≤ Di,j−1. Figure 3.5 shows that in both casesthe next cycle starts when the last of the two servers has finished servicing its current product.Only at this moment a product can be transferred between Mi and Mi+1. This is not possible atan earlier time, as is illustrated by Figures 3.5(a) and 3.5(b). In Figure 3.5(a) it can be seen thatMi is still servicing its product when Mi+1 has finished and in Figure 3.5(b) it can be seen Mi+1

is still busy servicing its product when Mi has finished. Thus, by the definition of Ci,j , it holdsthat Ai,j and Di,j start at the next moment a product is transferred between Mi and Mi+1.

Property 3.2. The end of Di,j−1 coincides with the start of Di+1,j .

9

Time

Ai,j−1

Di,j−1

Ci,j−1 Ci,j

Ai,j

Di,j

SiWi,j−1

Mi

Mi+1

(a) Ai,j−1 > Di,j−1

Time

Ai,j−1

Di,j−1

Ci,j−1 Ci,j

Ai,j

Di,j

SiWi,j−1

Mi

Mi+1

(b) Ai,j−1 ≤ Di,j−1

Figure 3.5: Possible cases during cycle j − 1 of Li.

10

Again, this property is a direct consequence of the definitions. Recalling equation 3.2, it holdsthat Di,j consists of the service time of Mi+1 and an optional blocking time. As the tandem lineis linear, Mi+1 can only be blocked when Mi+2 is not ready to accept a new product, i.e. Mi+2

is still processing a product or it is blocked by Mi+3. Furthermore, because the decomposition ofthe line is such that there is an overlap between subsystems, it holds that Mi+1 corresponds toDi and Ai+1. As a consequence, it holds that whenever Di,j−1 ends, Mi+1 transfers its j − 1th

product to Mi+2 and simultaneously ends cycle j − 1 of Li+1. Hence, Li+1 starts cycle j and byProperty 3.1 it holds that Di+1,j starts. Figure 3.6 illustrates this relation.

Time

Ai,j−1

Di,j−1

Ci,j−1

Ci+1,j

Ai+1,j

Di+1,j

Mi

Mi+1

Mi+2

Figure 3.6: Relation between Di,j−1 and Di+1,j .

3.3 Analyzing the arrival servers

This section presents an overview of how the EPT approach determines the value of Ai. Asmentioned in the previous section, the algorithm determines this value iteratively by inferringthe value of Ai,j during each cycle. Furthermore, recall that the tandem line is decomposed intosubsystems consisting of two servers. The use of this decomposition becomes apparent in theanalysis presented in this section. First, we start by examining the subsystems involved in thedetermination of Ai,j : Li and Li−1. These subsystems can be seen in Figure 3.7.

The approach needs to determine the value of Ai,j . Equation (3.1) has already shown that Aidepends on Si and Wi. Extending this equation to incorporate the current cycle j, we obtain thefollowing representation for Ai,j :

Ai,j = Wi,j + Si (3.4)

The value of Si is not cycle-dependent as the service times of the machines in the line do notchange between cycles. This only leaves the value of Wi,j to be inferred. Because Wi,j representsthe time that Mi is waiting before it receives new input from Mi−1, the value of this variable canchange between cycles. Note that the input of Mi in cycle j has to be produced in cycle j − 1 byMi−1. This can be seen in Figure 3.8.

11

Mi−1 MiLi−1

Ai−1 Di−1

Li Mi Mi+1

Ai Di

Figure 3.7: Subsystems relevant for determining Ai,j .

Time

Ai,j

Di,j

Wi,j Si

Events E1 E2 E3

Ci,j

Figure 3.8: Events relevant for inferring Ai,j .

Figure 3.8 shows cycle j of subsystem Li. The goal of the approach is to determine the valueof Ai,j . While determining this value, the length of Di,j is irrelevant. Only the events that aredenoted in the figure can influence the length of Ai,j :

• E1: the start of cycle j and Mi starts waiting for new input.

• E2: Mi receives input and starts processing.

• E3: Mi produces output.

Event E3 indicates that Mi is ready to send its output to Mi+1. Recalling equation (3.3), thelength of cycle j depends on whether or not Mi+1 has finished when this event occurs. Lookingat Figure 3.8, it becomes apparent that the value of Wi,j directly relates to E1 and E2. Wheneverthere is input ready for Mi upon the start of cycle j, it holds that E1 = E2. This causes Wi,j toreduce to 0. In the other case, when there is no input ready for Mi at the start of cycle j, thevalue of Wi,j is relevant for determining Ai,j . Therefore, it is necessary to distinguish the twocases when determining Ai,j . Figure 3.9 shows how the approach captures this distinction.

Ai,j :

Si

Si + (Ai−1,j−1 −Di−1,j−1)

fi−1,j−1

1− fi−1,j−1

Figure 3.9: Dependencies of Ai,j .

In Figure 3.9 the probability that E1 = E2 is denoted by fi−1,j−1. It captures the fact thatMi has input available at the start of cycle j, meaning that it has been produced by Mi−1 in

12

cycle j − 1. Therefore, it holds that fi−1,j−1 = P[Ai−1,j−1 ≤ Di−1,j−1]. Whenever the eventcorresponding to this probability occurs, Ai,j simply reduces to Si. When it does not occur,Ai,j = Si +Wi,j and Wi,j reduces to the difference between the length of Di−1,j−1 and Ai−1,j−1:Wi,j = Ai−1,j−1 −Di−1,j−1. Note that Wi,j is at least 0, as the value is conditioned on 1− fi−1.

3.4 Analyzing the departure servers

Now that it is known how to determine Ai, the next stochastic variable that needs to be determinedby the approach is Di. Similar to Ai, the algorithm determines this value iteratively by inferringthe value of Di,j during each cycle. However, determining the value of Di,j is more involved andrequires a more elaborate analysis. Figure 3.10 shows the subsystems involved in the determinationof Di,j : Li and Li+1.

Mi Mi+1Li

Ai Di

Li+1 Mi+1 Mi+2

Ai+1 Di+1

Figure 3.10: Subsystems relevant for determining Di,j .

The goal of the analysis is to determine the dependencies of Di,j . Once these dependenciesare known, the approach can use these to infer the value of Di,j . Similar to the determination ofAi,j , equation (3.2) can be extended to incorporate the current cycle j:

Di,j = Si+1 +Bi,j (3.5)

As before, Si+1 is not cycle-dependent. This only leaves the value of Bi,j to be determined.Recall that Bi,j denotes the time that Di is blocked from sending its output to Mi+1 during cyclej. Figure 3.11 shows how the value of Bi,j relates to the value of Di,j .

Time

Ai,j

Di,j

Si

Events E1 E2 E3

Ci,j

Bi,j

Figure 3.11: Events relevant for inferring Di,j .

Figure 3.11 shows cycle j of subsystem Li. Furthermore, the events that can influence thelength of Di,j are indicated:

• E1: the start of cycle j.

13

• E2: Mi+1 produces output and becomes blocked by Mi+2.

• E3: Mi+2 is ready to accept input and unblocks Mi+1.

Note that both Mi and Mi+1 start processing a new product at E1. Machine Mi+1 becomesblocked between the occurrence of E2 and E3, i.e. it has to wait until E3 occurs before it canbegin servicing a new product. Figure 3.11 also shows that the length of Bi,j directly relates tothe period between E2 and E3. Only when E3 occurs before E2, Bi,j reduces to 0 and Mi+1 doesnot become blocked.

It is not straightforward to determine exactly when E2 and E3 occur. Therefore, it requiresa more elaborate analysis to be able to determine the length of Di,j . In order to do so, a casedistinction has to be made. This case distinction pertains to cycle j − 1 of Li. The following twocases can be distinguished:

1. Ai,j−1 > Di,j−1: Di must wait for new input from Ai before the next cycle starts.

2. Ai,j−1 ≤ Di,j−1: Ai becomes blocked by Di and must wait before the next cycle starts.

Each case has different implications on the length of Bi,j , and thus also on the length of Di,j .In the first case, the probability that E3 occurs before E2 is increased. This is different for thesecond case. Therefore, both cases need a separate analysis in order to be able to determine thevalue of Di,j . These analyses are presented separately in the next two sections.

Server Di must wait

This section discusses the first case as mentioned above. In this case it holds that Ai,j−1 > Di,j−1.Thus, during cycle j−1 of subsystem Li, Di finishes first and must wait until Ai is finished beforethe next cycle can start. This has implications for the value of Bi,j , which will become clear whenlooking at Figure 3.12.

Time

Ai,j−1

Di,j−1

Si

Events E1 E2 E3

Ci,j−1

Bi,j

Ci,j

Ai,j

Di,j

E4 E5

Ai,j−1 −Di,j−1

Figure 3.12: Cycle j − 1 and j where Ai,j−1 > Di,j−1.

Figure 3.12 shows cycles j− 1 and j of subsystem Li. There are five important events that areindicated in this figure:

• E1: the start of cycle j − 1.

• E2: Mi+1 produces output that is transferred to Mi+2.

• E3: Mi produces output and cycle j starts.



A crucial observation is that event E2 indicates that machine Mi+2 has accepted the outputproduced by machine Mi+1. Recalling Property 3.2, it also indicates that Li+1 has already startedcycle j and Di+1,j is (possibly-) waiting for new input. This has consequences for determining the

14

values of Di,j and Bi,j . As Bi,j is defined to be the length of the period between E4 and E5, itis necessary to know when E5 occurs whenever E4 occurs (i.e. when Mi+2 is ready to accept newinput whenever Mi+1 has produced new output).

At the start of cycle j there are two options: Mi+2 is still busy processing the product itreceived at E2 or it has already finished doing so. The latter implies that E5 occurs before E3 andE4. As a consequence, Bi,j reduces to 0 as machine Mi+1 never becomes blocked during cycle j.The former is more intricate: when event E3 occurs, the remaining service time of Mi+2 is equalto Di+1,j − (Ai,j−1 −Di,j−1). This is because Mi+2 is already in cycle j and the additional timeLi required to complete cycle j − 1 can be subtracted. Both cases are illustrated in Figure 3.13.

TimeEvents

Li

Li+1

Mi

Mi+1Mi+2

E1 E2

E3

E5

E4 E5

E3 E4

Figure 3.13: Timeline showing Li and Li+1 when Ai,j−1 > Di,j−1.

The topmost case in Figure 3.13 occurs when Di+1,j > Ai,j−1 − Di,j−1. In this case, thevalue of Di,j becomes the maximum of Si+1 and the remaining service time of Mi+2: Di,j =max{Si+1, Di+1,j − (Ai,j−1 − Di,j−1)}. Similarly, the lower case in Figure 3.13 occurs whenDi+1,j ≤ Ai,j−1 −Di,j−1, causing the value of Di,j to reduce to Si+1. Note that only in this caseit can be determined that Bi,j = 0. In the other case, the value of Bi,j cannot be determinedexplicitly. This does not matter for our purposes, as the goal is to only determine the value ofDi,j . Figure 3.14 shows how the value of Di,j is conditioned with probability pi,j = P[Di+1,j >Ai,j−1 −Di,j−1].

Di,j :

max{Si, Di+1,j − (Ai,j−1 −Di,j−1)}

Si

pi,j

1− pi,j

Figure 3.14: Dependencies of Di,j when Ai,j−1 > Di,j−1.

Server Ai becomes blocked

This section discusses the case in which it holds that Ai,j−1 ≤ Di,j−1. A consequence of thiscondition is that the length of Di,j−1 determines the length of cycle j − 1 of Li. Opposed to thepreceding case, machine Mi+2 does not start processing cycle j of Li+1 until cycle j − 1 of Li hasended.

Figure 3.15 shows cycles j − 1 and j of subsystem Li. Again, five important events can bedistinguished during these two cycles:

• E1: the start of cycle j − 1.

• E2: Mi produces output and becomes blocked by Mi+1.

• E3: Mi+1 produces output and cycle j starts.


15

Time

Ai,j−1

Di,j−1

Si

Events E1 E2 E3

Ci,j−1

Bi,j

Ci,j

Ai,j

Di,j

E4 E5

Figure 3.15: Cycle j − 1 and j where Ai,j−1 ≤ Di,j−1.


TimeEvents

Li

Li+1

Mi

Mi+1

Mi+2

E1 E2 E3 E4 E5

Figure 3.16: Timeline showing Li and Li+1 when Ai,j−1 ≤ Di,j−1.

Due to Property 3.2, in this case it holds that Di+1,j starts at the same time as Di,j . This canclearly be seen in Figure 3.16 and causes the value of Di,j to become the maximum of Si+1 andDi+1,j . Although it is again not possible to explicitly determine the value of Bi,j , the followingrepresentation characterizes Di,j in this particular case:

Di,j = max{Si+1, Di+1,j} (3.6)

Implications for Di,j

Now that both cases for cycle j − 1 of Li have been discussed, it is possible to construct a generalrepresentation for Di,j . This representation is presented in Figure 3.17. It can be seen that Di,j isfirst conditioned on which server in Li finishes first during cycle j − 1. Note that the probabilitythat is used to condition Di,j , fi,j−1, is equal to the probability that is used to condition Ai,j .After conditioning, Di,j reduces to equation (3.6) in case Ai,j−1 ≤ Di,j−1. In the other case, thevalue of Di,j needs to be conditioned on the remaining service time of Mi+2. This is captured byprobability pi,j , as defined in the first case above.

Di,j :

max{Si+1, Di+1,j}

Si+1

fi,j−1

1− fi,j−1pi,j

1− pi,j

max{Si+1, Di+1,j − (Ai,j−1 −Di,j−1)}

Figure 3.17: Dependencies of Di,j .

All the dependencies of Ai and Di have now been considered. These dependencies allow theEPT approach to be converted into an algorithm that operates on a generic production line.

16

However, applying the EPT approach in the context of translation requires several adaptationsto the initial EPT algorithm. Therefore, the next chapter first discusses the adaptations beforepresenting the accompanying algorithm.

17

Chapter 4

Modeling translation with EPT

The goal of our research is to analyze ribosome processivity. This requires having a model oftranslation that corresponds with the translation process as introduced in Chapter 2. Therefore,this chapter explores multiple approaches to constructing such a model. One of these approachesconsists of using Markov chains to construct a model and applying probabilistic model checkingto obtain results. This model checking approach is presented in Section 4.1. Another approachconsists of modeling the mRNA as a production line and applying the EPT approach introducedin Chapter 3. However, the EPT approach as presented cannot be applied directly to the processof translation. Some aspects of translation require adaptations to the approach. Without theseadaptations, the production line that is used to model the mRNA does not accurately reflect thebiological model. The adapted EPT approach is presented in Section 4.2.

The two approaches that are considered in this Section extend upon the work of Bosnacki etal. In [2] they present a formal analysis of ribosome kinetics by making use of probabilistic modelchecking. Using the model checker Prism, they construct a model of translation that is based onthe detailed model of ribosome kinetics presented in [5]. This model is different from the modelwe aim to construct as it is concentrates on the translation steps occuring at the ribosome bindingsite. Our model is constructed at a higher level of abstraction; it considers individual genes and weare interested in predicting the time required to construct specific proteins. However, the model ofBosnacki et al. is able to infer the average insertion times and insertion errors for each particularcodon. We make use of these average insertion times because our model focusses primarily onpredicting the construction time of proteins.

4.1 Modelchecking

One of the approaches to constructing a model of translation consists of using Markov chains.However, defining the model as an explicit Markov chain is cumbersome as each state needs to beenumerated explicitly. Fortunately, the model checker Prism offers an intermediate specificationlanguage that can ease this process. The specification language of Prism is a simple state-basedlanguage, allowing models to be specified as a collection of modules. These modules each have aninternal state and operate in parallel. Furthermore, Prism offers a synchronization mechanism thatenables interaction between the modules. Prism supports three types of probabilistic models inits specification language: discrete-time Markov chains (DTMCs), continuous-time Markov chains(CTMCs) and Markov decision processes (MDPs).

The objective is to construct a model that covers the translation of mRNA and is able topredict the time required to construct a particular protein. This asks for a level of abstractionthat incorporates a strand of mRNA as a whole and involves multiple ribosomes. Furthermore, themodel is specified as a CTMC in order to respect the exponentially-distributed reactions occurringin the biological process. Once the model is constructed, the probabilistic model checking facilitiesof Prism can be used to infer the total translation time. The following aspects have to be considered

18

in constructing the model:

1. The model is based on a single strand of mRNA. As a consequence, the model is unique toa particular gene.

2. Ribosomes move along the mRNA. That is, ribosomes arrive at the mRNA with a certainrate and move along the mRNA with a rate that is dependent on the codon the ribosomeresides on.

3. Multiple ribosomes move along the mRNA simultaneously. As ribosomes cannot movethrough each other, they can form queue-like structures. These queue-like structures arecalled a polysomes (or polyribosomes). The phenomenon of polysome formation is termedribosome crowding [9].

4. Each ribosome physically occupies 12 codons [9].

The first aspect states that the model is unique to a particular gene. As a consequence, a newmodel has to be constructed for each gene. This is necessary as the mRNA that is considered inthe model is different for each gene. The second aspect states that the rate at which the ribosomemoves along the mRNA depends on the codon the ribosome resides on. This means that the modelmust implement 64 different rates, one for each codon. These rates are based on those that arederived in [2]. However, there is a minor difference with respect to the rates as mentioned in [2].The rates that are inferred by the model of Bosnacki et al. pertain to an entire translation step ofthe ribosome. In our model the peptidyl transfer phase and the translocation phase are consideredas two separate steps. Both these steps comprise a single translation step of the ribosome. Inorder to use the rates as presented in [2], we re-evaluated the model of Bosnacki et al. and usedthe model checker to only obtain the rates of the peptidyl transfer phase. This has been done foreach of the 64 codons. The rate of the translocation phase is taken directly from [2]. The thirdaspect states that ribosomes can form polysomes. This phenomenon of ribosome crowding canhave an impact on the translation time. Therefore, it is crucial that the model incorporates thisphenomenon accurately. The fourth aspect states that a ribosome occupies 12 codons. This widthhas to be taken into account by the model.

Now that it is known which aspects have to be incorporated into the model, it is possible tospecify a model in Prism’s specification language. The model consists of only a single modulein which the mRNA is encoded explicitly. To accurately reflect the mRNA, the module containsstate variables for each codon. The encoding of these state variables allows for three differentstates per codon:

• 0: No ribosome is present on the codon.

• 1: A ribosome is present on the codon and it is ready to perform a peptidyl transfer step.

• 2: A ribosome is present on the codon and it is ready to perform a translocation step.

A crucial observation is that a ribosome can only become blocked by preceding ribosomes whenit is ready to perform a translocation step. This is due to the fact that the ribosome physicallymoves to a new codon during the translocation step. Furthermore, the fourth aspect stated aboveimplies that the ribosome is only allowed to perform this step when the next 12 codons arenot occupied by preceding ribosomes. Therefore, it is necessary to express this condition in thetransitions of the module. These transitions are specified for each codon. There is one transitionfor each codon that specifies that the ribosome can perform a peptidyl transfer step. The rate ofthis transition depends on the type of codon and has been inferred from [2] as explained above.Furthermore, there is also a transition for each codon that specifies that the ribosome can performa translocation step. This transition is only enabled when the next 12 codons are not occupied.Again, the rate of this transition is obtained from [2] as explained above. Note that the first andlast codons in the strand are exceptions and require special transitions. It is assumed that new

19

ribosomes always arrive at the mRNA whenever the first 12 codons are not occupied. Therefore,the model incorporates a special transition expressing this condition. Furthermore, the last codonhas a special transition expressing that ribosomes leave the strand during the last translocationstep. Figure 4.1 shows an example of this modeling approach. It can be seen that the first ribosomeon the mRNA is prevented from performing a translocation step as the next 12 codons are stilloccupied. The second ribosome is ready to perform a peptidyl transfer step and the last ribosomeis ready to leave the strand with a translocation step.

2 1 2

Peptidyl transferTranslocation Translocation

Figure 4.1: Example of a Prism model.

Using the model presented above, the model checker can be used to infer the time requiredto construct a specific protein. Unfortunately, it turns out that this type of model exhibits astate space explosion. Even for short genes the approach is unfeasible and the model checkercannot obtain results. However, it is possible to limit the number of simultaneous ribosomes onan mRNA to two. This greatly reduces the number of states in the model and prevents the statespace explosion.

The idea of the second modelchecking approach is to let the first ribosome on the mRNA,the so-called leading ribosome, move along the strand at a fixed rate. The second ribosome onthe mRNA is the actual ribosome that is used by the model checker to infer the constructiontime of the protein. The modelchecker is invoked iteratively and after each iteration the rate ofthe leading ribosome is updated. The new rate of the leading ribosome is based on the proteinconstruction time inferred during the preceding iteration. A graphical overview of this model canbe seen in Figure 4.2. Note that a third state has been added to the state variables to account forthe presence of the leading ribosome.

2 3

Leading ribosomeTranslocation

Figure 4.2: Example of a Prism model limited to two simultaneous ribosomes.

The second modelchecking approach requires invoking the modelchecker iteratively in order toobtain results. The number of times the modelchecker has to be invoked is bounded as only a finitenumber of ribosomes can be present on a real mRNA. When the maximum number of ribosomesallowed for a specific mRNA is x, the modelchecker has to be invoked exactly x times. This way,the approach estimates protein construction time under the condition that the mRNA is saturatedwith ribosomes. However, results indicate that the approach is not accurate and it suffers from anaccumulation of errors. These errors are increased during each iteration and result in unaccurateestimates returned by the modelchecker. This behaviour is not unexpected as the approach onlytakes into account the first moment of the rates associated with the reactions. Using higher ordermoments could improve the error, which is exactly what is done by the approach presented in thenext section.

4.2 Adapting the EPT approach

Chapter 3 introduced the EPT approach. It was explained that the approach can be used to obtainestimates for the throughput and delay of logistic processes. As was mentioned in the beginning ofthis chapter, the approach itself can also be used to model the process of translation. In order to

20

apply the EPT approach to translation, it is required to consider translation as a logistic process.This view of translation is different from the view employed in the modelchecking approach of thelast section. Therefore, this section explains how the EPT approach can be used in the context oftranslation and what adaptations are necessary to the initial EPT approach presented in Chapter3. The adaptations result in an algorithm that can obtain estimates for the time required toconstruct particular proteins. The algorithm itself is presented in the next section.

First, it is explained how translation can be viewed as a logistic process. Translation can beconsidered a logistic process where the finished products correspond to proteins and the machinesin the production line to codons in the mRNA. Thus, the operations performed by the machinescorrespond to adding amino acids to the growing peptide chain. Note that the ribosome itselfis not explicitly modeled in this view. In order to incorporate ribosomes into the model, twoadaptations need to be made to the initial EPT approach. The first adaptation consists of addinga deterministic offset to the service time of each machine. This deterministic offset correspondsto the length of the translocation phase. As a result, only the peptidyl transfer phase accountsfor the rate differences between codons. The second adaptation is to make use of periodic servicetimes for each machine in the line. By incorporating these periodic service times, the approachenforces a certain distance between products in the line. This distance corresponds with the widthof a ribosome. The adaptation allows the approach to account for the phenomenon of ribosomalcrowding which was explained in Section 4.1. Furthermore, all aspects that are mentioned inSection 4.1 still hold for the modeling approach presented in this section.

Offsets

The first adaptation consists of adding deterministic offsets to the service times of the machines.This deterministic offset is equal for all machines in the line. Recall from Chapter 3 that theservice time of each machine is denoted by Si. The input of the initial EPT approach specifiesthe characteristics of these Si by stating E[Si] and Var(Si). However, this representation of Sichanges when incorporating the deterministic offsets into the approach.

Fortunately, Frenken already added offsets in his thesis by providing an improved EPT ap-proach [6]. In this improved EPT approach, the service time of Mi is still denoted by Si, butnow each Si consists of a deterministic part and a stochastic part. More specifically, it holds thatSi = Y + Zi, where 1 ≤ i ≤ K and 0 ≤ Y < E[Si] for a line consisting of K machines. Inthis representation, Y corresponds to the deterministic offset and Zi corresponds to the stochasticpart. In our model of translation it holds that Zi corresponds to the peptidyl transfer phaseand Y to the translocation phase. Figure 4.3 shows how the service time is characterized in thisrepresentation. Note that the order of the deterministic part and the stochastic part does notmatter in the context of translation. This is because only the total duration of all translationsteps, each one consisting of a peptidyl transfer phase and a translocation phase, is of relevancewhen estimating the time required to create a protein.

Original EPT approach:

Adapted EPT approach:

Si

Yi Zi

Si

Figure 4.3: Adding offsets to Si.

Adding offsets to each Si has no major consequences for the dependencies between the stochas-tic variables as introduced in Chapter 3. The only consequences pertain to the derivation of theprobabilities fi,j and pi,j , as well as the representation of the variables Ai,j and Di,j . For a com-plete derivation of these changes the reader is refered to [6]. The most important consequence ofincorporating offsets is that Ai,j and Di,j now consist of a deterministic offset and a stochasticpart. The new representation of these variables becomes as follows:

21

Ai,j = Y +A′i,j (4.1)

Di,j = Y +D′i,j (4.2)

An overview of the dependencies of Ai,j and Di,j that reflects these adaptations is given at theend of this section.

Periodic service times

The second adaptation to the initial EPT approach consists of incorporating periodic service times.As mentioned in Section 4.1, ribosomes physically occupy 12 codons. Our model needs to reflectthis aspect in order to accurately model the biological process. We have done this by enforcing adistance of 11 machines between the products in the line. This is done by having each machineprocess a product only once every 12 cycles. During the 11 cycles that a machine is not processinga product, it has a service time that is characterized by E[Si] = 0. This effectively reduces Mi toa buffering place during those cycles.

The service time of each machine behaves with a period of 12 cycles. These periods areimplemented such that when a machine is processing a product, the 11 preceding and 11 subsequentmachines operate as buffering places. Figure 4.4 shows how the periods are implemented.

cycle:

1

2

3

12

13

14

M1 M2 M3 ... M12 M13 M14

S1

S2

S3

S12

S13

S14

0

0

0

0

0

0 0

0

0

0

0

0

S1

S2

S3

S12

S13

S14

0

0

0

0

0

0 0

0

0

0

0

0

...

...

...

Figure 4.4: Periodic service times.

Incorporating the periodic service times in the initial EPT approach only requires one adapta-tion: instead of having a cycle-independent Si, Si now depends on the current cycle j. Therefore,the new representation for the service time of Mi becomes Si,j . This change is reflected by thedependencies that are given at the end of this section.

Dependencies

Now that the EPT approach has been adapted to the context of translation, the dependenciesbetween the stochastic variables Ai,j and Di,j need to be updated in order to reflect the changes.Figure 4.5 shows the new dependencies.

4.3 Adapted algorithm

The previous section introduced two adaptations to the initial EPT approach. These adaptationsmade it possible to apply the approach in the context of translation. However, it remains tostate the algorithm that results from incorporating the adaptations. This algorithm enables theapproach to be applied to generic production lines. In Chapter 5 it can be seen how the algorithmis applied to a number of case studies involving real genes. This section introduces the algorithmby presenting an overview that is accompanied by pseudocode.

22

Di,j :

max{Si+1,j , Di+1,j}

Si+1,j

fi,j−1

1− fi,j−1pi,j

1− pi,j

max{Si+1,j , Di+1,j − (A′i,j−1 −D′i,j−1)}

Ai,j :

Si,j

Si,j + (A′i−1,j−1 −D′i−1,j−1)

fi−1,j−1

1− fi−1,j−1

fi−1,j−1 =P[A′i−1,j−1 ≤ D′i−1,j−1]

fi,j−1 =P[A′i,j−1 ≤ D′i,j−1]

pi,j =P[D′i+1,j > A′i,j−1 −D′i,j−1]

Figure 4.5: Dependencies of Ai,j and Di,j .

The input of the algorithm consists of a description of a production line. Consider the produc-tion line consists of K machines, then the input of the algorithm consists of E[Si], Var(Si) andE[Y ] for all 1 ≤ i ≤ K. Note that Y denotes the deterministic offset of Si and E[Y ] is assumed tobe equal for all Si. Furthermore, the input of the algorithm also consists of the maximum numberof cycles it has to iterate, as well as the period of the service times of the machines. As is explainedin Section 4.1, in the context of translation this period is equal to 12.

The outline of the algorithm is as follows:

1. Initialization: initialize the periodic service time table.

2. For each cycle j:

(a) Forward determination of all Ai,j : starting from the first subsystem, go forwardthrough the line and determine the new values of all Ai,j .

(b) Backward determination of all Di,j : starting from the last subsystem, go backwardthrough the line and determine the new values of all Di,j .

(c) Determine estimates: determine new estimates for the throughput and delay.

The pseudocode of the algorithm can be found in Algorithm 1. The algorithm starts byperforming an initialization step. During this step, the algorithm creates a table that contains theservice times for each cycle. This table is indexed during each cycle of the algorithm in order toenforce a distance between products in the line. Furthermore, during the initialization step thealgorithm also fits a distribution on each A′i,0 and D′i,0. The initialization corresponds to lines10-17 of Algorithm 1.

23

The function InitPeriodicSTable returns a table that contains the service times for eachcycle of the algorithm. Each row in this table corresponds to a particular cycle. For a particularmachine Mi, the service time during cycle j is given by Si,j . Only the service times Sx,j , withx = j mod 12 + 12n + 1 and n ≥ 0, specify real service times that correspond to those initiallyspecified by S′. All other Si,j are 0, reducing the corresponding machines to buffering places.Note that it has to hold that x ≤ K, which is enforced by InitPeriodicSTable.

Once the initialization is complete, the algorithm proceeds by iteratively determining newestimates for the stochastic variables associated with the arrival and departure servers. In doingso, the algorithm uses the dependencies as presented in Figure 4.5. These dependencies are used todetermine the first and second moments of each Ai,j and Di,j . In line 21 it can be seen that eachcycle starts by updating the variables associated with the first arrival server and last departureserver. Because it is assumed that there is always new input available and finished products canalways leave the production line, these servers do not rely on the state of other machines. Hence,the value of these variables only depends on the current service time of the corresponding machine.

For each cycle, the algorithm proceeds by fitting distributions on each Ai,j−1 and Di,j−1.The first and second moments of these variables have been inferred during the previous cycle.Lines 28-31 show that the algorithm infers new values for the first two moments of each Ai,j for2 ≤ i ≤ N − 1. Once finished, the algorithm does the same for the first two moments of Di,j

for 1 ≤ i ≤ N − 2. Note that because Di,j depends on Di+1,j , the algorithm also needs to fit adistribution on Di+1,j prior to inferring Di,j . Furthermore, if the service time that was used toupdate the value of Di,j is not 0, then Di processed a real product during cycle j and its value isstored in order to later calculate the delay. The array Dr is used to store these values. The entireprocess of inferring Di,j can be seen in lines 33-41.

At the end of each cycle, the algorithm needs to determine new estimates for the throughputand delay of the entire production line. This can be seen in lines 43-49. In Section 3.1 thethroughput of subsystem i in cycle j was determined to be Ti,j = 1

E[Ci,j ]. However, this does

no longer hold since the adapted EPT algorithm incorporates periodic service times. Becauseeach server now only processes a product once every period, as opposed to once every cycle, thefollowing equation characterizes the throughput for the adapted EPT algorithm:

Ti,j =1∑p−1

m=0 E[Ci,j−m](4.3)

In this equation p denotes the period of the service times. The throughput of the productionline is obtained by taking the average of all Ti,j . Similarly, the delay is defined as the total timeeach product spends in the production line. Therefore, it is equal to the service time of eachmachine in the line, plus the average blocking time of those machines. Recall from equation 3.5that Di,j exactly represents an estimate for this value. Because in the adapted EPT algorithm noteach Di,j corresponds to processing a real product, only those Di,j that correspond to processingreal products are taken into account while inferring a new estimate for the delay. Furthermore, asM1 does not correspond to any Di,j , the average blocking time of products at M1 still needs tobe inferred. Since it is assumed that there is always new input available for this server, the sumof its service time and average blocking time is equal to C1,j .

After inferring estimates for the throughput and delay for a specified number of cycles, thealgorithm ends by returning the estimates obtained during the last cycle. This concludes thedescription of Algorithm 1. The pseudocode of the functions used throughout the algorithm canbe found in Appendix C. Furthermore, the mathematical operations that are frequently used bythe algorithm are explained in Appendix D and the actual Matlab code of the algorithm can befound in Appendix E.

24

Algorithm 1 EvaluateLine(S′, maxcycles, period)

1: K ← length(S′)2: N ← K − 13: A← array[1 . . . N, 0 . . .maxcycles]4: D ← array[1 . . . N, 0 . . .maxcycles]5: Dr ← array[1 . . . N ] {To keep track of last product for each D.}6: S ← array[1 . . .K, 0 . . .maxcycles]7: W ← array[1 . . . N ]8: f ← array[1 . . . N ]9:

10: {Step 1: Initialization.}11: S ← InitPeriodicSTable(S′,period, maxcycles)12: for all Si,j do13: Si,j ← FitDistribution(Si,j)14: end for15: for i = 1 to N do16: Ai,0 ← Si,0; Di,0 ← Si+1,0

17: end for18:

19: for j = 1 to maxcycles do20: {Update first arrival server and last departure server.}21: A1,j−1 ← S1,j−1; DN,j−1 ← SN+1,j−122: {Fit distributions on A and D.}23: for i = 1 to N do24: Ai,j−1 ← FitDistribution(Ai,j−1)25: Di,j−1 ← FitDistribution(Di,j−1)26: end for27:

28: {Step 2a: Infer new estimates for each A.}29: for i = 1 to N − 2 do30: Ai+1,j ← DetermineA(Si+1,j , Ai,j−1, Di,j−1,Wi+1, fi)31: end for32:

33: {Step 2b: Infer new estimates for each D.}34: for i = N − 2 downto 1 do35: Di+1,j ← FitDistribution(Di+1,j)

36: Wi ← FitDistribution(Wi)

37: Di,j ← DetermineD(Si+1,j , Di+1,j , Ai,j−1, Di,j−1,Wi, fi)38: if Si+1,j 6= 0 then39: Dri ← Di,j

40: end if41: end for42:

43: {Step 2c: Calculate new estimates.}44: for i = 1 to N do45: throughputi ← 1/

∑period−1m=0 max(Ai,j−m, Di,j−m)

46: end for47: throughput ←∑N

i=1 throughputi/N

48: delay ←∑Ni=1Dri + (1/max(A1,j , D1,j))

49: end for50:

51: return (throughput, delay)

25

Chapter 5

Experiments

The previous chapter introduced an EPT approach that was adapted to the context of translation.It was explained that the approach can be used to infer estimates for the throughput and delayof real and synthetic genes. Now all that remains is to evaluate the approach by applying it inactual experiments. Therefore, this chapter introduces a set of generic experiments and a set ofpathway experiments. The generic experiments serve to provide insight into the behaviour of theadapted EPT algorithm and the pathway experiments apply the algorithm to a series of real genes.The generic experiments are discussed in Section 5.2 and the pathway experiments in Section 5.3.However, before the experiments are discussed in those sections, Section 5.1 explains how geneticdata is used to construct input for the algorithm. Finally, Section 5.4 presents suggestions forfurther research.

The main research question of this thesis was stated in Chapter 1. Recall that Question 1asked whether the EPT approach can be used to analyze ribosome processivity. As we now havean EPT approach that is adapted to the context of translation, we can state the following morespecific question:

Question 2. Does the adapted EPT approach accurately predict the time required to constructspecific proteins?

This question leads to the main research question and is answered by performing the pathwayexperiments in Section 5.3. Furthermore, we are also interested in gaining insight into how the EPTapproach behaves in the context of translation. This leads to the following additional questions:

Question 3. How does the adapted EPT approach behave when varying the periods of the servicetimes?

Question 4. How does the adapted EPT approach behave when varying the length of the genes?

Question 5. How does the adapted EPT approach behave when varying the rates of the codons?

Each of these questions corresponds to a generic experiment that is discussed in Section 5.2.The generic experiments are presented first in order to provide insight into the behaviour of thealgorithm in the context of translation.

5.1 Constructing input

Now we discuss how actual genetic data is used to create input for the algorithm. All real genesthat are used throughout this chapter originate from the Escherichia coli K-12 MG1655 organismand are taken from the KEGG database. In order to convert a particular gene into a productionline, a separate machine has to be defined for each of its codons. The service time associated witheach of the machines is characterized by the rate of the corresponding codon. The rates can befound in Appendix A. Note that the rates listed in Appendix A comprise both the peptidyl transfer

26

phase and the translocation phase for each particular codon. As a consequence, the service timehas to be corrected in order to account for the constant time associated with the translocationphase. Recall that the service time of machine i is characterized by the equation Si = Y + Zi, inwhich Y denotes the stochastic offset that corresponds to the translocation phase. Consider thatcodon Xi is associated with rate λi from Appendix A. Using this information, it is possible torepresent the service time for codon Xi as follows:

E[Si] = 1/λi (5.1)

Var(Si) = 1/λ2i (5.2)

E[Y ] = 0.01326 (5.3)

Equations 5.1 to 5.3 enable the adapted EPT algorithm to fit a distribution on each Si. Oncethis distribution is fitted, the algorithm proceeds as discussed in Section 4.3. The genes that areused as input for the algorithm were extracted from the KEGG database by making use of aPython script. The algorithm itself is implemented in Matlab, as well as the supporting functionsthat are used to convert the genetic data into a production line. The Matlab code can be foundin Appendix E.

5.2 Generic experiments

This section presents a set of generic experiments that provide insight into the behaviour of theadapted EPT algorithm. First it is investigated how the algorithm behaves when varying theperiods. The same is done for varying the length of a gene, as well as the rates of the codonscomprising the gene. All the experiments in this section have been evaluated for 50 cycles. Thisensures the estimates have converged upon termination of the algorithm.

Varying periods

The first generic experiment serves to investigate the behaviour of the adapted EPT algorithmwhen varying the periods of the service times. Note that the period is always 12 in the contextof translation, but the algorithm allows for any arbitrary period greater than 0. Furthermore, aperiod of 1 results in an implementation that discards periodic service times. In this particularcase it holds that the service time of each machine is equal for each cycle. In order to investigatethe behaviour when varying periods, we applied the algorithm to three genes consisting of 200codons. The first gene consists of 200 copies of the codon CUU, which is associated with a slowtranslation rate. The second gene consists of 200 copies of the codon CGA, which has a mediumtranslation rate, and the last gene consists of 200 copies of the codon ACU, which has a fasttranslation rate. The genes have been chosen to be homogeneous (i.e. consisting of only one typeof codon) in order to negate any influence arising from the variance in the service times of themachines comprising the production line. Figures 5.1 and 5.2 show the inferred throughput anddelay for each gene.

In Figure 5.1 it can be seen that for each gene the throughput decreases when the periodincreases. Furthermore, it can also be seen that a faster translation rate results in a higherthroughput. Both results are not unexpected. One of the consequences of increasing the periodis that it takes each machine more cycles to process a product. Recall that a period of x meansthat each machine processes a product once every x cycles. As the throughput is defined to bethe number of products processed per unit of time, increasing the period results in decreasing thethroughput. Furthermore, having a faster procesing rate for each machine means that less timeis required to produce a completed product. As a result, the throughput of the entire productionline increases.

In Figure 5.2 it can be seen that for each gene the delay decreases when the period increases.Again, a faster translation rate results in a lower delay. This is in accordance with its definition,

27

1 2 3 4 5 6 7 8 9 10 11 120

0.5

1

1.5

2

2.5

3

3.5

period

thro

ughp

ut (

prot

eins

/sec

ond)

CUUCGAACU

Figure 5.1: Throughput for various periods.

1 2 3 4 5 6 7 8 9 10 11 120

50

100

150

200

250

300

350

400

period

dela

y (s

econ

ds)

CUU

CGA

ACU

Figure 5.2: Delay for various periods.

as the delay is defined to be the time required to produce a completed product. Regarding boththe throughput and delay, it is now possible to answer Question 3. The experiments showed thatthe inferred throughput and delay both decrease when increasing the period of the service times.Furthermore, the decrease of the delay is not as pronounced as the decrease of the throughput.

Figures 5.3 and 5.4 show the simulation results for the throughput and delay of each gene.Looking at Figure 5.3, it can clearly be seen that increasing the period results in a decrease inthroughput for each gene. However, this decrease in throughput exhibits a different shape than theresults inferred by the adapted EPT algorithm. The difference between the throughput obtainedthrough simulations and the throughput obtained by the adapted EPT algorithm is greater forslowly-translating genes. A similar observation holds for the delay. Figure 5.4 also shows thatthe decrease in delay exhibits a different shape than the results obtained by the adapted EPTalgorithm. Furthermore, the difference between both results is greater for slowly-translating genes.

28

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

period

thro

ughp

ut (

prot

eins

/sec

ond)

CUUCGAACU

Figure 5.3: Simulated throughput for various periods.

1 2 3 4 5 6 7 8 9 10 11 120

50

100

150

200

250

300

period

dela

y (s

econ

ds)

CUUCGAACU

Figure 5.4: Simulated delay for various periods.

29

Varying length

The second generic experiment serves to investigate the behaviour of the adapted EPT algorithmwhen varying the length of a gene. The algorithm has been evaluated for a homogenous gene thatonly contains the codon GCA. The length of the gene is varied from 50 codons to 1000 codons byusing increments of 50. The periods of the service times have been set to 12. Figures 5.5 and 5.6show the inferred throughput and delay.

100 200 300 400 500 600 700 800 900 10000.125

0.13

0.135

0.14

0.145

0.15

0.155

0.16

0.165

0.17

gene length (codons)

thro

ughp

ut (

prot

eins

/sec

ond)

Figure 5.5: Throughput for various lengths.

100 200 300 400 500 600 700 800 900 10000

100

200

300

400

500

600

700


dela

y (s

econ

ds)

Figure 5.6: Delay for various lengths.

In Figure 5.5 it can be seen that the throughput decreases when the length of the gene increases.In fact, the throughput of this particular homogenous gene converges towards 12.5. The trendvisible in the figure resembles that of exponential decay. Figure 5.6 shows that the delay increaseswhen the length of the gene increases. This increase in delay behaves linearly with the increase inlength. At this point it is possible to answer Question 4. The experiments showed that an increasein the length of a gene results in a decrease of the inferred throughput. This decrease convergesslowly in a way that resembles exponential decay. The opposite is true for the inferred delay: itexhibits a linear increase with the length of the gene.

Figures 5.7 and 5.8 show the simulation results for the throughput and delay of each gene.Figure 5.7 shows that the simulated throughput does not decrease as fast as the throughput

30

100 200 300 400 500 600 700 800 900 10000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18


thro

ughp

ut (

prot

eins

/sec

ond)

Figure 5.7: Simulated throughput for various lengths.

100 200 300 400 500 600 700 800 900 10000

50

100

150

200

250

300


dela

y (s

econ

ds)

Figure 5.8: Simulated delay for various lengths.

obtained by the adapted EPT algorithm. Note that for genes longer than 700 codons an anomalycan be seen. This anomaly can be caused by the fact that the simulations only evaluate a fixednumber of events. As the gene gets longer, more events need to be evaluated in order to obtainreliable simulation results. Hence, the results indicate the simulations should have been evaluatedfor a larger number of events. Figure 5.8 shows that the simulated delay increases slower than thedelay obtained by the adapted EPT algorithm. Again, an anomaly is observed for genes longerthan 650 codons. This anomaly could possibly have been prevented by evaluating the simulationsfor a larger number of events.

31

Varying rates

The last generic experiment serves to investigate the behaviour of the adapted EPT approachwhen varying the rates of the codons. To this end, the algorithm has been evaluated with differentrates defined for each codon. The gene that was used as input for this experiment is the eco:b1073gene, which is part of flagellar assembly pathway that is introduced in Section 5.3. The periodsof the service times have been set to 12. The experiment was repeated multiple times, each timewith a different proportion of the fastest and slowest translating codons replaced by the averageof all 64 codons. Note that a proportion of 50% indicates that all codons in the gene have thesame rate. Figures 5.9 and 5.10 show the inferred throughput and delay.

0 10 20 30 40 500.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

proportion highest/lowest rates replaced (%)

thro

ughp

ut (

prot

eins

/sec

ond)

Figure 5.9: Throughput for various rates.

0 10 20 30 40 5050

60

70

80

90

100

110

120

130


dela

y (s

econ

ds)

Figure 5.10: Delay for various rates.

By increasing the proportion of the fastest and slowest translating codons replaced by theaverage rate, we are effectively decreasing the variance of the service times in the production line.Thus the gene becomes more homogenous when increasing the proportion of the codons replaced.Figure 5.9 shows that an increase in the proportion of codons replaced results in an increase in theinferred throughput. Conversely, Figure 5.10 shows that an increase in the proportion of codonsreplaced results in a decrease of the inferred delay. This answers Question 5.

32

0 10 20 30 40 500.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24


thro

ughp

ut (

prot

eins

/sec

ond)

Figure 5.11: Simulated throughput for various rates.

0 10 20 30 40 5035

40

45

50

55

60

65

70

75

80


dela

y (s

econ

ds)

Figure 5.12: Simulated delay for various rates.

Figures 5.11 and 5.12 show the simulation results for the throughput and delay of each gene.Both figures show the same pattern as the results obtained by the adapted EPT algorithm. How-ever, the simulated delay is slightly lower than the delay inferred by the algorithm.

33

5.3 Pathway experiments

This section presents three genetic pathways that are used as case studies to test the adaptedEPT algorithm. Each pathway is introduced by means of a short description and the results arepresented at the end of this section.

Flagellar assembly

The flagellar assembly pathway pertains to the formation of a prokaryotic flagellum. A flagellumis a tail-like structure that protrudes from the prokaryotic cell and enables the organism to movethrough its surroundings. The Escherichia coli bacteria has multiple flagella that allow it toperform a biased random walk. The bacteria can influence the random walk through its metabolicsystem, which consists of genetic pathways that sense the foodsources available in its surroundings.By switching between different types of foodsources and a state of starvation, the metabolic systeminfluences the speed of the rotary engines that drive the flagella.

A flagellum itself can be divided into three parts: the filament, the basal body and the hook.The filament is constructed out of the protein flagellin, which assembles into a hollow tube ex-tending from the membrane of the organism. Around tens of thousands of proteins are required toconstruct the filament. Once constructed, the filament assumes a supercoiled form that allows it tobe used as a helical propeller. The filament is attached to the hook, which is a short highly-curvedtube that consists of around 120 copies of the protein FlgE. This hook is attached to the basalbody that is embedded into the membrane of the organism. The flexibility of the hook allowsit to transmit torque from the basal body to the filament. The basal body is the actual rotaryengine that drives the flagellum. It is a complex structure that consists of 20 different proteins.As the basal body is embedded into the membrane and connected to the hook, the organism caninfluence the rotation speed of the flagellum from within the cell [4].

We used the flagellar assembly pathway as indicated in the KEGG database as a case studyfor our adapted EPT algorithm. The pathway involves 36 genes in total1. Note that this pathwayprimarily regulates the construction of an organelle. Therefore, it does not constitute an ongoingcellular process such as the glycolysis pathway or the SOS response pathway.

Glycolysis

The glycolysis pathway is a metabolic pathway that converts glucose into pyruvate. It is a centralpathway to nearly all aerobic and anaerobic organisms. During the process of glycolysis, the sugarmolecule glucose is converted in pyruvate and small amounts of energy in the form of ATP andNADH are released. It consists of 10 definite reactions that involve 10 intermediate compounds.The reactions can be divided intwo two phases. The first phase consists of five reactions thatconsume energy in order to convert glucose into two three-carbon sugar phosphates. The secondphase consists of the other five reactions. Opposed to the first phase, these reactions result in anet gain of energy in the form of ATP and NADH. All reactions in the glycolysis pathway areregulated by activating or inhibiting the enzymes involved, which is done in response to conditionsinside and outside the cell.

Again, we used the glycolysis pathway as indicated in the KEGG database as a case study forour adapted EPT algorithm. The pathway involves 39 genes and constitutes an ongoing cellularprocess2.

SOS response

The SOS response pathway comprises a response to DNA damage inside the cell. This damagecan be caused by both normal metabolic activities and environmental factors. Whenever DNAdamage occurs, it is followed by an accumulation of single stranded DNA. This single stranded

1http://www.genome.jp/dbget-bin/www bget?pathway:eco020402http://www.genome.jp/dbget-bin/www bget?pathway:eco00010

34

DNA results in a high expression of the RecA protein, which triggers the subsequent SOS response.One of the first steps in this response is activation of the Nucleotide Excision Repair (NER)pathway. The NER pathway consists of two subpathways: Global Genomic Repair (GGR) andTranscription Coupled Repair (TCR). The difference between GGR and TCR consists of theirabilities to recognize different kinds of damage to the structure of the DNA.

The recognition of lesions in GGR is a dynamic two-stage process in Escherichia coli bacteria.The process occurs throughout the entire genome and scans both transcribed and untranscribedgenes. The initial recognition step of GGR consists of sensing a distortion in the helical shapeof the DNA. This step is performed by the UvrA protein. The second step is a verification stepwhere the protein UvrB opens the strand and determines whether a nucleotide has been altered andwhich strand is affected. Once this verification step has taken place, GGR proceeds by generatingincisions in the damaged strand. The strand is incised such that the altered nucleotide can beremoved, after which GGR utilizes the complementary strand as a template in order to insert theproper nucleotide. After the repair has taken place, the strands are re-ligated and the process iscomplete. TCR differs from GGR in the sense that it only occurs during transcription. It is notcontrolled through the SOS response, but initiated whenever transcription is arrested due to DNAdamage. Once initiated, the protein Mfd removes the blocked RNA polymerase from the strandof DNA and recruits UvrA in order to proceed with the same steps as in GGR [7].

We used the GGR subpathway as indicated in the KEGG database as a case study for ouradapted EPT algorithm. This pathway involves 8 genes3.

Results

This section presents the results of applying the adapted EPT algorithm to the case studiesmentioned above. As the case studies consist of actual genetic data, the results obtained providean indication of the effectiveness of the adapted EPT algorithm in the context of translation.In order to obtain accurate estimates for the throughput and delay of each gene, the algorithmhas been run for exactly 50 cycles. This ensures that even for large genes the estimates haveconverged upon termination of the algorithm. As our goal is to construct a model of translation,we are primarily interested in the total delay inferred by the algorithm. This delay correspondsto the time required to create a single protein. As we are also interested in investigating theeffectiveness of the algorithm, we used discrete event simulations in order to establish referencevalues for both the throughput and delay of each gene. These reference values have been averagedover 100 simulation runs consisting of 10 000 events each. Note that this approach only yields pointestimations for the reference values. However, for our purposes these estimations are consideredsufficient as it is infeasible to obtain reliable confidence intervals. Inferring a 95% CI for thethroughput of a gene exceeding 800 codons requires millions of simulation runs if the width of theinterval is to be smaller than 1% of the particular estimate.

The results obtained for each pathway can be found in Appendix B. The remainder of thissection is mainly concerned with the effectiveness of the algorithm. In order to investigate thiseffectiveness, Figures 5.13 and 5.14 show how the inferred throughput and delay relate to thereference values obtained through the discrete event simulations. The horizontal axis shows thelength of each gene in codons and the vertical axis shows the the ratio of the value obtained bythe algorithm over the reference value obtained through the simulations. A value of 1 means thatthe value inferred by the algorithm corresponds exactly to the value obtained by the simulations.

In Figure 5.13 it can be seen that the algorithm greatly overestimates the throughput of eachgene. This throughput stays below 1.5 times the reference value for genes smaller than 300 codons.For genes over 300 codons, the overestimation increases greatly and even surpasses 12 times thereference value for genes longer than 700 codons. However, for genes longer than 800 codons theupward trend does not continue and the throughput does not seem to relate to the length of thegene any longer. Figure 5.14 shows the algorithm also overestimates the delay of each gene. Theoverestimation of the delay increases linearly with the gene size. However, the difference does

3http://www.genome.jp/dbget-bin/www bget?pathway:eco03420

35

0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6

7

8

9

10

11

12

13

14


thro

ughp

ut w

.r.t.

sim

ulat

ions

Figure 5.13: Scatter plot showing the difference in throughput for the pathways.

0 100 200 300 400 500 600 700 800 900 10001

1.5

2

2.5


dela

y w

.r.t.

sim

ulat

ions

Figure 5.14: Scatter plot showing the difference in delay for the pathways.

never exceed 2.5 times the reference value. Again, it can also be seen that the linear trend doesnot continue for genes longer than 800 codons.

Looking at Figures 5.13 and 5.14, we can conclude that the adapted EPT algorithm is onlysuccessfull in predicting the throughput and delay for genes smaller than 300 codons. This isunfortunate, as the majority of actual genes is longer than 300 codons. Regarding Question 2 thiscould lead to the conclusion that the adapted EPT algorithm is unable to accurately predict thetime required to create specific proteins. However, the figures also show that there is a clear trendin the overestimation exhibited by the algorithm. The overestimation of the delay is linear and theoverestimation of the throughput seems exponential. In fact, the overestimation can be closelydescribed by a fourth order polynomial. This polynomial can be found when the outliers arediscarded and a trendline is fitted to the overestimation in throughput. Discarding genes longerthan 800 codons, as these do not adhere to the visible trends, the following equations describe theoverestimation of the throughput and delay:

Ot(x) = 9× 10−11x4 − 5× 10−8x3 + 9× 10−6x2 + 0.0017x+ 0.7562 (5.4)

Od(x) = 0.0015x+ 1.249 (5.5)

36

In both equations the variable x denotes the length of the gene in codons. Now that theamount of overestimation can be predicted, it is possible to divide the throughput inferred for agene of length x by Ot(x) and similarly, the delay for a gene of length x by Od(x). Doing so yieldsa new value that closely resembles the reference value. Figures 5.15 and 5.16 show these correctedvalues. Note that genes longer than 800 codons have been discarded.

0 100 200 300 400 500 600 700 8000

0.5

1

1.5

2


thro

ughp

ut w

.r.t.

sim

ulat

ions

Figure 5.15: Scatter plot showing the corrected difference in throughput for the pathways.

0 100 200 300 400 500 600 700 8000

0.5

1

1.5

2


dela

y w

.r.t.

sim

ulat

ions

Figure 5.16: Scatter plot showing the corrected difference in delay for the pathways.

Looking at Figures 5.15 and 5.16, it can be seen that the corrected values closely resemble thereference values. In fact, the throughput stays between 91% and 123% of the reference value. Forthe delay this is between 89% and 127%. Regarding Question 2, this leads to the conclusion thatthe adapted EPT algorithm can accurately predict the time required to construct a specific proteinwhen the error is corrected by Equation 5.5 and the gene is shorter than 800 codons. Note thatsince Equations 5.4 and 5.5 are derived from the results of our case studies, care should be takenwhen extending this conclusion to other genes. However, the fact that all three pathways exhibitthe same trend in the error of estimates inferred by the algorithm contributes to the confidence inthis conclusion.

37

5.4 Further research

The previous section discussed the results of applying the adapted EPT approach to genetic data.It was seen that the algorithm produced good estimates for the throughput and delay of shortgenes, but for long genes the error increased progressively. Therefore, this section introducesseveral improvements that could improve the estimates returned by the algorithm.

Grouping codons

Grouping codons might improve the estimates returned by the adapted EPT approach. It is arelatively minor adaptation and only requires changes to the input of the algorithm. The currentapproach considers each codon in the mRNA as a separate machine in the production line. Bygrouping codons and representing them by only one machine, the length of the production linecan be considerably reduced. As Section 5.3 demonstrated that the adapted EPT algorithm yieldsa lower error for short production lines, it can be expected that this adaptation yields betterestimates.

Ribosomes physically occupy 12 codons. Hence, it is straightforward to use groups of 12codons for a single machine. The service time of this machine can then simply be assumed to beexponentially distributed. This way, the sum of the average insertion times of the codons can beused to determine the service time of the machine. This is similar to what has been done for eachcodon in our adapted EPT approach. Recall that the average insertion times can be found inAppendix A. However, it remains to be investigated what impact grouping any number of codonshas on the estimates returned by the EPT approach. Furthermore, the estimates could also beimproved by only grouping neighboring codons with a rate difference below a certain threshold.

Instead of considering multiple codons as a single machine, the mRNA could also be dividedinto smaller sub-mRNAs. By dividing the mRNA into smaller sub-mRNAs, it becomes possible torun the algorithm for each sub-mRNA separately. The resulting estimates can then be aggregatedin order to obtain an estimate for the entire mRNA. The throughput can simply be averaged andthe delay can be summed. Again, it remains to be investigated what error this approach induceson the estimates.

Adding symmetry

Section 3.4 introduced the dependencies of Di,j . It was observed that the algorithm could conditiontwice while inferring a new estimate for Di,j : once on fi,j−1 and once on pi,j . However, thealgorithm only conditions once on fi,j while inferring the value of Ai,j . This makes the EPTapproach asymmetric. If it is possible to add more symmetry to the approach, i.e. also conditioningtwice while inferring a new value for Ai,j , it might be possible to improve the inferred estimates.Therefore, this section considers the dependencies of Ai,j as given in Section 3.3 and investigatesone possible approach to add symmetry to the adapted EPT approach.

Equation 3.4 states that Ai,j consists of a cycle-dependent waiting time and the service timeof Mi. However, the waiting time Wi,j reduces to 0 whenever it holds that Ai−1,j−1 ≤ Di−1,j−1.When it holds that Ai−1,j−1 > Di−1,j−1, the result is that Wi,j = Ai−1,j−1 −Di−1,j−1. This canalso be seen in Figure 5.17. Furthermore, note that the case distinction on the values of Ai−1,j−1and Di−1,j−1 gave rise to the conditioning on fi−1,j−1 in Figure 3.9.

Figure 5.17 conveys more information than was used to infer the dependencies of Ai,j . Due toproperties 3.1 and 3.2, Ai,j and Di,j start upon completion of Di−1,j−1. However, it can also beseen that Ai−1,j and Di−1,j start at the end of Wi,j . This has consequences for determining thevalue of Ai,j+1 as the length of Wi,j+1 depends on whether Ai−1,j has finished whenever Ai,j+1

starts. This leads to the following case distinction:

1. Ai,j − (Ai−1,j−1 −Di−1,j−1) > Ai−1,j ⇒ Ai,j+1 = Si

2. Ai,j − (Ai−1,j−1 −Di−1,j−1) ≤ Ai−1,j ⇒ Ai,j+1 = Si +Wi,j+1

38

Time

Ai−1,j−1

Di−1,j−1

Ai,j

Ai−1,j

Di−1,j

SiWi,j

Figure 5.17: Correspondence between Ai,j and Ai−1,j−1 when Ai−1,j−1 > Di−1,j−1.

In the first case it holds that Ai−1,j has already finished upon the start of Ai,j+1. As aconsequence, Wi,j+1 reduces to 0. In the second case this does not hold and the representationof Wi,j+1 stays as before. Note that Ai,j − (Ai−1,j−1 − Di−1,j−1) = Si. Therefore, when theequations above are rewritten and we substitute j − 1 for j, the following dependencies can beobserved:

1. Ai−1,j−1 < Si ⇒ Ai,j = Si

2. Ai−1,j−1 ≥ Si ⇒ Ai,j = Si +Wi,j

Figure 5.18 shows the dependencies of Ai,j , updated with the additional dependencies identifiedabove. It can be expected that the EPT approach returns better estimates when these additionaldependencies are added to the approach. However, further experiments need to be conducted inorder to investigate this claim.

Ai,j :

Si

Si + (Ai−1,j−1 −Di−1,j−1)

fi−1,j−1

1− fi−1,j−1

fi−1,j−1 =P[Ai−1,j−1 ≤ Di−1,j−1]

Si

ki,j

1− ki,j

ki,j =P[Ai−1,j−1 < Si]

Figure 5.18: Symmetrical dependencies of Ai,j .

Incorporating conservation of flow

In his thesis, Frenken used the concept of conservation of flow in order to enforce the throughputof all subsystems to converge to the same value. This conservation of flow accounts for the errorsinduced by the EPT approach. These errors arise from considering only the first two moments ofthe approach and taking into account a limited amount of dependencies between the machines. In[6] he shows that the equation that corresponds to the conservation of flow principle can be usedto infer the first moment of Ai,j . However, the dependencies as presented in Figure 3.17 are stillused to infer the second moment.

The conservation of flow principle states that the throughput of all subsystems converges tothe same value. This is inherently true for logistic processes as each subsystem encounters the

39

same number of products during a certain period. More specifically, for the initial EPT methodthe conservation of flow principle states that:

E[Ci−1,j ] = E[Ci,j ] (5.6)

Before we can use this equality to infer the first moment of Ai,j , it is necessary to introduceπi,j : the fraction of time that Ai is processing during cycle j. This fraction is equal to:

πi,j =E[Ai,j ]

E[Ci,j ](5.7)

Now E[Ci,j ] can be expressed in terms of Ai,j and πi,j . Rewriting Equation 5.6 using Equation5.7 yields the following:

E[Ai,j ] = πi,jE[Ci−1,j ] (5.8)

Note that in Equation 5.8 the fraction πi,j appears in the right-hand side. This is unfortunate,as according to Equation 5.7 it also depends on the value of Ai,j . In [6] this problem is solved byusing the estimates inferred during the previous cycle. Thus, the equation that is used to inferthe first moment of Ai,j becomes:

E[Ai,j ] =E[Ai,j−1]

E[Ci,j−1]E[Ci−1,j ] (5.9)

The conservation of flow principle can also be extended to the adapted EPT approach. Inorder to do so, the principle has to account for the use of periodic service times. In the adaptedEPT approach it still holds that the throughput of each subsystem must converge to the samevalue. However, as each subsystem only processes a product once every p cycles, the result is thatEquation 5.6 does no longer hold. Correcting this equation to account for period p yields:

p−1∑k=0

E[Ci−1,j−k] =

p−1∑k=0

E[Ci,j−k] (5.10)

Again, this equation can be rewritten to a similar representation as Equation 5.9. This yields:

E[Ai,j ] = πi,j

p−1∑k=0

E[Ci−1,j−k] (5.11)

Equation 5.11 suffers from the problem that the fraction πi,j can no longer be estimated bythe values inferred during the previous cycle. As the service times are periodic, the entire periodhas to be taken into account in estimating πi,j . Therefore, equation 5.12 can be used to estimateπi,j in equation 5.11.

πi,j =E[Ai,j−p]∑p−1k=0 E[Ci,j−k]

(5.12)

40

Chapter 6

Conclusions

In this thesis we have used the EPT approach to analyze ribosome processivity. Ribosome pro-cessivity is part of the biological process of translation as was introduced in Chapter 2. It wasexplained that ribosomes construct proteins based on a strand of mRNA. Furthermore, it was seenthat there are numerous aspects such as ribosome crowding that can influence the time required tocreate proteins. Therefore, predicting the time required to create a specific protein is non-trivialand requires an elaborate model of the biological process. In Chapter 4 it was seen that themodelchecker Prism could not operate on such a detailed model due to a state space explosion.Reducing the number of states in the model did no longer result in accurate predictions. This wascaused by an accumulation of errors that resulted from a loss of detail in the underlying Prismmodel. The fact that the Prism model only accounts for the first moment of the time required toprocess a single codon contributes to this accumulation of errors.

A better approach was needed in order to overcome the aforementioned problems. Therefore,our research focussed primarily on the EPT approach. The EPT approach can be used to analyzelogistic processes and obtain estimates for their throughput and delay. It was introduced inChapter 3. However, the EPT approach could not be used directly in the context of translationand several adaptations were required. These adaptations were introduced in Chapter 4 andconsisted of adding offsets to the service times and incorporating periodic service times. Addingoffsets to the approach was already done in previous work [6]. Incorporating the adaptations intothe approach yielded a new algorithm, which was applied to a series of experiments in Chapter5. The first set of experiments provided insight into the behaviour of the adapted EPT approach.This yielded the following conclusions:

• The estimates of the throughput and delay produced by the algorithm both decrease whenthe periods of the service times are increased. This decrease is greater for the throughputthan for the delay.

• The estimate of the throughput produced by the algorithm decreases when the length of thegene is increased. The decrease resembles exponential decay and converges for long genes.

• The estimate of the delay produced by the algorithm increases when the length of the geneis increased. This increase is linear with respect to gene size.

• Decreasing the variance of the translation rates per codon in a gene results in an increase inestimated throughput and a decrease in estimated delay.

The second set of experiments presented in Chapter 5 made use of genes originating from threedifferent pathways. The estimates returned by the algorithm were compared with reference valuesobtained through discrete event simulations. This yielded the following conclusions:

• For genes shorter than 300 codons, the estimated throughput stayed below 1.5 times that ofthe reference value.

41

• For genes longer than 300 codons and shorter than 800 codons, the error in the estimatedthroughput increases systematically for increasing gene size.

• For genes longer than 800 codons, the error in the estimated throughput does not seem torelate to gene length.

• For genes shorter than 800 codons, the error in the estimated delay increases linearly withgene size. However, the error stays below 2.5 times that of the reference value.

• For genes longer than 800 codons, the error in the estimated delay does not seem to relateto gene length.

The conclusions presented above do not indicate that EPT can successfully be used to analyzeribosome processivity. However, for genes shorter than 800 codons the error with respect to thereference value is systematic. Therefore, equations 5.4 and 5.5 can be used to correct for the errorsin the estimated throughput and delay. This yielded the following conclusions:

• For genes shorter than 800 codons, the corrected estimated throughput stays between 91%and 123% of the reference value.

• For genes shorter than 800 codons, the corrected estimated delay stays between 89% and127% of the reference value.

The corrected estimates are certainly more accurate than the estimates returned directly bythe algorithm. The range as indicated above is acceptable, especially since the estimates areobtained much faster than reference values obtained by simulation. However, as the error is nolonger systematic for genes longer than 800 codons, the approach is unsuccessful in producingestimates for genes exceeding 800 codons. Furthermore, the data was obtained by evaluatinggenes from three different pathways. Care needs to be taken when extrapolating these results toother pathways. At this point it is possible to answer the main research question of this thesiswhich was stated in Chapter 1. For the sake of completeness, it is repeated below.

Question 1. Can the EPT approach be used to analyze ribosome processivity?

This question can be answered positively, under the condition that the adaptations that areprovided in Chapter 4 are incorporated into the approach. However, only genes shorter than 800codons can be analyzed successfully when the estimates are corrected according to equations 5.4and 5.5. These equations do not hold for genes longer than 800 codons. As a result, the EPTapproach does not yield accurate results for long genes.

42

Bibliography

[1] D. Bosnacki, H.M.M. ten Eikelder, M.N. Steijaert, and E.P. de Vink. Stochastic analysis ofamino acid substitution in protein synthesis. In M. Heiner and A.M. Uhrmacher, editors, Proc.CMSB 2008, pages 367–386. LNCS 5307, 2008.

[2] D. Bosnacki, T.E. Pronk, and E.P. Vink. In silico modelling and analysis of ribosome kineticsand aa-tRNA competition. Transactions on Computational Systems Biology, IX:69–89, 2009.CompMod 2008 special issue, R.-J. Back and I. Petre, guest editors.

[3] B. Alberts et al. Molecular Biology of the Cell. Garland Science, fifth edition, 2007.

[4] F.A. Samatey et al. Structure of the bacterial flagellar hook and implication for the molecularuniversal joint mechanism. Nature, 431:1062–1068, 2004.

[5] A. Fluitt, E. Pienaar, and H. Viljoen. Ribosome kinetics and aa-tRNA competition determinerate and fidelity of peptide synthesis. Computational Biology and Chemistry, 31:335–346, 2007.

[6] P.W. Frenken. Decomposition algorithms with EPT based input: A case study in the auto-mobile industry. Master’s thesis, TU/e, 2007.

[7] P.C. Hanawalt. Subpathways of nucleotide excision repair and their regulation. Oncogene,21:8949–8956, 2002.

[8] M. van Vuuren and I.J.B.F Adan. Performance analysis of tandem queues with small buffers.IIE Transactions, 41(10):882–892, 2009.

[9] H. Zouridis and V. Hatzimanikatis. A model for protein translation: polysome self-organizationleads to maximum protein synthesis rates. Biophysical Journal, 92:717–730, 2007.

43

Appendix A

Codon rates

Table A.1: Average insertion times and rates for each codon.

Codon Avg. Insertion Time Rate

UUU 0.3327 3.005711UUC 0.8404 1.18991UUG 0.1245 8.032129UUA 0.4436 2.254283UCU 0.0893 11.19821UCC 0.7409 1.34971UCG 0.3035 3.294893UCA 0.2313 4.32339UGU 0.1432 6.98324UGC 0.3296 3.033981UGG 0.436 2.293578UGA 0.1098 9.107468UAU 0.0758 13.19261UAC 0.2008 4.98008UAG 0.4319 2.315351UAA 0.0963 10.38422CUU 0.8901 1.123469CUC 0.6286 1.590837CUG 0.1028 9.727626CUA 0.9217 1.084952CCU 0.4202 2.379819CCC 0.1992 5.02008CCG 0.4257 2.349072CCA 0.5535 1.806685CGU 0.0645 15.50388CGC 0.101 9.90099CGG 1.3993 0.714643CGA 0.0962 10.39501CAU 0.8811 1.134945CAC 0.5341 1.872309CAG 0.7425 1.346801CAA 0.4058 2.464268GUU 0.0527 18.97533GUC 0.767 1.303781

Continued on next page

44

Table A.1 – Continued from previous page

Codon Avg. Insertion Time Rate

GUG 0.1041 9.606148GUA 0.2604 3.840246GCU 0.0756 13.22751GCC 1.5622 0.640123GCG 0.101 9.90099GCA 0.3002 3.331113GGU 0.0924 10.82251GGC 0.1673 5.977286GGG 0.2308 4.332756GGA 1.2989 0.769882GAU 0.218 4.587156GAC 0.4144 2.413127GAG 0.1106 9.041591GAA 0.2243 4.458315AUU 0.2733 3.658983AUC 0.4373 2.28676AUG 0.8115 1.232286AUA 0.4321 2.314279ACU 0.0943 10.60445ACC 0.4658 2.146844ACG 0.4073 2.455193ACA 0.5025 1.99005AGU 0.1636 6.112469AGC 0.3905 2.560819AGG 1.4924 0.670062AGA 0.5517 1.812579AAU 0.2242 4.460303AAC 0.4959 2.016536AAG 0.3339 2.994909AAA 0.1945 5.141388

The rates in Table A.1 have been obtained from [1].

45

Appendix B

Pathway experiment results

KEGG gene Nr. of EPT EPT Simulated Simulated Throughput Delayidentifier: codons: throughput: delay: throughput: delay: difference: difference:

b1070 139 0.10 112.12 0.10 70.94 1.02 1.58b1071 98 0.11 69.92 0.11 50.51 1.02 1.38b1072 220 0.09 201.58 0.07 145.56 1.19 1.38b1073 139 0.09 124.46 0.09 79.92 1.05 1.56b1074 135 0.10 112.25 0.09 72.34 1.04 1.55b1075 232 0.10 195.50 0.08 124.02 1.23 1.58b1076 403 0.09 355.79 0.05 187.07 2.03 1.90b1077 252 0.10 207.66 0.07 121.83 1.35 1.70b1078 261 0.10 216.48 0.07 122.54 1.42 1.77b1079 233 0.10 195.36 0.08 128.38 1.33 1.52b1080 366 0.09 335.92 0.05 194.73 1.77 1.73b1082 548 0.09 480.69 0.02 232.11 4.23 2.07b1083 318 0.09 270.91 0.06 158.92 1.54 1.70b1879 693 0.09 638.59 0.01 277.86 11.09 2.30b1880 383 0.09 340.89 0.05 194.20 1.95 1.76b1889 309 0.09 291.47 0.06 176.39 1.56 1.65b1890 296 0.10 253.95 0.07 153.90 1.49 1.65b1891 193 0.09 173.25 0.08 123.56 1.08 1.40b1892 117 0.11 88.52 0.11 63.15 0.98 1.40b1923 499 0.10 384.57 0.03 191.23 3.13 2.01b1924 469 0.09 436.68 0.03 233.17 2.72 1.87b1925 137 0.09 127.60 0.08 80.78 1.07 1.58b1926 122 0.10 98.88 0.10 61.38 1.10 1.61b1937 105 0.11 81.07 0.10 58.44 1.04 1.39b1938 553 0.10 468.08 0.02 229.09 4.40 2.04b1939 332 0.09 302.36 0.05 186.71 1.67 1.62b1940 229 0.09 208.03 0.07 137.62 1.18 1.51b1941 458 0.10 392.60 0.03 203.35 2.76 1.93b1942 148 0.08 140.77 0.08 121.90 1.07 1.15b1943 376 0.09 339.50 0.05 194.04 1.94 1.75b1945 335 0.09 288.19 0.06 171.56 1.61 1.68b1946 138 0.11 105.07 0.10 70.20 1.04 1.50b1947 122 0.10 99.49 0.09 67.07 1.11 1.48b1948 246 0.10 206.00 0.08 129.76 1.30 1.59b1949 90 0.09 83.23 0.09 61.70 1.00 1.35b1950 262 0.09 230.83 0.07 137.66 1.37 1.68

Table B.1: Flagellar assembly pathway results.

46


b0114 888 0.11 658.24 0.01 307.57 7.32 2.14b0115 631 0.11 452.77 0.02 209.46 7.46 2.16b0116 475 0.11 342.49 0.04 176.92 3.04 1.94b0356 370 0.10 289.25 0.06 163.94 1.84 1.76b0688 547 0.10 435.77 0.03 211.34 3.91 2.06b0755 251 0.15 142.05 0.11 90.15 1.34 1.58b0756 347 0.09 309.05 0.05 183.03 1.74 1.69b1002 414 0.10 342.88 0.04 186.45 2.20 1.84b1101 478 0.11 373.22 0.04 199.35 2.75 1.87b1241 892 0.11 669.04 0.02 301.94 6.97 2.22b1478 337 0.09 287.31 0.06 177.55 1.68 1.62b1621 531 0.09 465.26 0.02 231.96 3.95 2.01b1676 471 0.11 347.23 0.04 179.57 2.96 1.93b1723 310 0.09 276.40 0.06 157.18 1.53 1.76b1734 451 0.10 359.77 0.04 190.02 2.64 1.89b1779 332 0.12 220.32 0.08 129.29 1.57 1.70b1780 295 0.11 232.56 0.07 129.12 1.53 1.80b1854 481 0.10 380.77 0.03 191.49 3.04 1.99b2097 351 0.10 286.79 0.06 152.91 1.70 1.88b2388 322 0.10 261.26 0.06 157.57 1.59 1.66b2417 170 0.12 114.26 0.12 73.96 1.06 1.54b2453 396 0.10 342.04 0.05 191.80 2.05 1.78b2715 486 0.09 434.60 0.03 224.91 3.15 1.93b2716 475 0.09 414.06 0.03 215.21 2.91 1.92b2779 433 0.12 291.58 0.05 151.96 2.34 1.92b2901 480 0.10 390.92 0.04 203.63 2.85 1.92b2925 360 0.11 257.19 0.06 143.62 1.75 1.79b2926 388 0.13 241.24 0.06 133.98 2.04 1.80b3403 541 0.10 447.23 0.03 220.12 3.92 2.03b3589 384 0.09 341.48 0.05 194.37 1.98 1.76b3612 515 0.11 392.80 0.03 196.45 3.38 2.00b3721 471 0.09 414.48 0.03 213.86 2.71 1.94b3916 321 0.11 224.40 0.08 132.49 1.53 1.69b3919 256 0.13 156.08 0.10 94.72 1.30 1.65b3925 337 0.11 256.70 0.07 138.10 1.63 1.86b4025 550 0.10 451.24 0.02 217.41 4.22 2.08b4069 653 0.09 555.45 0.01 250.32 8.31 2.22b4232 333 0.10 262.23 0.06 154.39 1.60 1.70b4395 216 0.10 178.66 0.08 109.19 1.28 1.64

Table B.2: Glycolysis pathway results.


b0779 674 0.10 552.63 0.01 247.86 9.18 2.23b1741 296 0.09 263.23 0.06 159.63 1.50 1.65b1913 611 0.10 515.66 0.01 235.81 7.73 2.19b2411 672 0.11 505.17 0.01 224.26 9.10 2.25b3647 561 0.09 517.85 0.02 245.04 5.08 2.11b3813 721 0.09 663.79 0.01 279.96 12.35 2.37b3863 929 0.10 763.30 0.01 336.17 9.38 2.27b4058 941 0.10 777.23 0.01 350.36 8.23 2.22

Table B.3: SOS response pathway results.

47

Table B.4: Corrected results for all pathways.


b0779 674 0.10 552.63 0.01 247.86 0.99 1.01b1741 296 0.09 263.23 0.06 159.63 1.04 1.03b1913 611 0.10 515.66 0.01 235.81 1.23 0.99b2411 672 0.11 505.17 0.01 224.26 1.00 1.00b3647 561 0.09 517.85 0.02 245.04 1.10 0.99b3813 721 0.09 663.79 0.01 279.96 1.01 0.98b0115 631 0.11 452.77 0.02 209.46 1.05 1.02b0116 475 0.11 342.49 0.04 176.92 1.08 1.01b0356 370 0.10 289.25 0.06 163.94 1.04 1.02b0688 547 0.10 435.77 0.03 211.34 0.92 1.00b0755 251 0.15 142.05 0.11 90.15 1.02 1.03b0756 347 0.09 309.05 0.05 183.03 1.06 1.05b1002 414 0.10 342.88 0.04 186.45 1.05 1.02b1101 478 0.11 373.22 0.04 199.35 0.96 1.05b1478 337 0.09 287.31 0.06 177.55 1.05 1.08b1621 531 0.09 465.26 0.02 231.96 1.02 1.02b1676 471 0.11 347.23 0.04 179.57 1.07 1.01b1723 310 0.09 276.40 0.06 157.18 1.03 0.97b1734 451 0.10 359.77 0.04 190.02 1.06 1.02b1779 332 0.12 220.32 0.08 129.29 1.00 1.03b1780 295 0.11 232.56 0.07 129.12 1.06 0.94b1854 481 0.10 380.77 0.03 191.49 1.04 0.99b2097 351 0.10 286.79 0.06 152.91 1.02 0.95b2388 322 0.10 261.26 0.06 157.57 1.03 1.04b2417 170 0.12 114.26 0.12 73.96 0.93 0.97b2453 396 0.10 342.04 0.05 191.80 1.05 1.03b2715 486 0.09 434.60 0.03 224.91 1.05 1.02b2716 475 0.09 414.06 0.03 215.21 1.03 1.02b2779 433 0.12 291.58 0.05 151.96 1.02 0.99b2901 480 0.10 390.92 0.04 203.63 0.98 1.03b2925 360 0.11 257.19 0.06 143.62 1.02 1.00b2926 388 0.13 241.24 0.06 133.98 1.08 1.02b3403 541 0.10 447.23 0.03 220.12 0.95 1.01b3589 384 0.09 341.48 0.05 194.37 1.06 1.04b3612 515 0.11 392.80 0.03 196.45 0.96 1.01b3721 471 0.09 414.48 0.03 213.86 0.98 1.01b3916 321 0.11 224.40 0.08 132.49 1.00 1.02b3919 256 0.13 156.08 0.10 94.72 0.98 0.99b3925 337 0.11 256.70 0.07 138.10 1.02 0.94b4025 550 0.10 451.24 0.02 217.41 0.98 1.00b4069 653 0.09 555.45 0.01 250.32 1.02 1.00b4232 333 0.10 262.23 0.06 154.39 1.01 1.03b4395 216 0.10 178.66 0.08 109.19 1.04 0.96b1070 139 0.10 112.12 0.10 70.94 0.96 0.92b1071 98 0.11 69.92 0.11 50.51 1.05 1.01b1072 220 0.09 201.58 0.07 145.56 0.95 1.14b1073 139 0.09 124.46 0.09 79.92 0.98 0.94b1074 135 0.10 112.25 0.09 72.34 0.99 0.94b1075 232 0.10 195.50 0.08 124.02 0.97 1.01b1076 403 0.09 355.79 0.05 187.07 1.01 0.97b1077 252 0.10 207.66 0.07 121.83 1.02 0.95b1078 261 0.10 216.48 0.07 122.54 1.06 0.93b1079 233 0.10 195.36 0.08 128.38 1.05 1.05b1080 366 0.09 335.92 0.05 194.73 1.01 1.04b1082 548 0.09 480.69 0.02 232.11 0.99 1.00b1083 318 0.09 270.91 0.06 158.92 1.02 1.01b1879 693 0.09 638.59 0.01 277.86 1.07 1.00b1880 383 0.09 340.89 0.05 194.20 1.05 1.04b1889 309 0.09 291.47 0.06 176.39 1.05 1.04b1890 296 0.10 253.95 0.07 153.90 1.03 1.03b1891 193 0.09 173.25 0.08 123.56 0.91 1.10b1892 117 0.11 88.52 0.11 63.15 0.96 1.02b1923 499 0.10 384.57 0.03 191.23 0.97 0.99b1924 469 0.09 436.68 0.03 233.17 1.00 1.04b1925 137 0.09 127.60 0.08 80.78 1.01 0.92b1926 122 0.10 98.88 0.10 61.38 1.07 0.89b1937 105 0.11 81.07 0.10 58.44 1.05 1.01b1938 553 0.10 468.08 0.02 229.09 1.00 1.02b1939 332 0.09 302.36 0.05 186.71 1.06 1.08b1940 229 0.09 208.03 0.07 137.62 0.94 1.05b1941 458 0.10 392.60 0.03 203.35 1.07 1.00b1942 148 0.08 140.77 0.08 121.90 0.99 1.27b1943 376 0.09 339.50 0.05 194.04 1.07 1.04b1945 335 0.09 288.19 0.06 171.56 1.01 1.04

Continued on next page

48

Table B.4 – Continued from previous page


b1946 138 0.11 105.07 0.10 70.20 0.97 0.97b1947 122 0.10 99.49 0.09 67.07 1.08 0.97b1948 246 0.10 206.00 0.08 129.76 1.00 1.02b1949 90 0.09 83.23 0.09 61.70 1.05 1.03b1950 262 0.09 230.83 0.07 137.66 1.02 0.98

0 100 200 300 400 500 600 7000

1

2

3

4

5

6

7

8

9

10

11

12


thro

ughp

ut w

.r.t.

sim

ulat

ions

Figure B.1: Scatter plot showing the difference in throughput for the flaggelar assembly pathway.

49

0 100 200 300 400 500 600 7001

2

2.5


dela

y w

.r.t.

sim

ulat

ions

Figure B.2: Scatter plot showing the difference in delay for the flagellar assembly pathway.

100 200 300 400 500 600 700 800 9001

2

3

4

5

6

7

8

9


thro

ughp

ut w

.r.t.

sim

ulat

ions

Figure B.3: Scatter plot showing the difference in throughput for the glycolysis pathway.

50

100 200 300 400 500 600 700 800 9001.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3


dela

y w

.r.t.

sim

ulat

ions

Figure B.4: Scatter plot showing the difference in delay for the glycolysis pathway.

200 300 400 500 600 700 800 900 10000

2

4

6

8

10

12

14


thro

ughp

ut w

.r.t.

sim

ulat

ions

Figure B.5: Scatter plot showing the difference in throughput for the SOS response pathway.

51

200 300 400 500 600 700 800 900 10001.6

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5


dela

y w

.r.t.

sim

ulat

ions

Figure B.6: Scatter plot showing the difference in delay for the SOS response pathway.

52

Appendix C

Adapted EPT algorithmpseudocode

Algorithm 2 FitDistribution(X)

1: {Account for the offset of X}2: E[X]← E[X]− offset3: E[X2]← Var(X) + E[X]2

4: Var(X)← E[X2] + offset2 − 2 ∗ offset ∗ E[X]5: {Fit a distribution on X}6: scv ← Var(X)

E[X]2

7: if scv ≤ 1 then8: Type(X)← Erlangk,k−1

9: k(X)← d 1

scve

10: p(X)← 1

1 + scv(k ∗ scv +

√k(1 + scv) + k2scv)

11: µ(X)← k(X)− p(X)

E[X]12: else13: Type(X)← Hyper-exponential

14: a← 1

2scv

15: µ1(X)← 2

E[X]16: µ2(X)← aµ1(X)

17: p(X)← a

1− µ2(X)/µ1(X)18: end if19: return X

53

Algorithm 3 DetermineA(Si+1,j , Ai,j−1, Di,j−1,Wi+1, fi)

1: {Step 1: Determine fi}2: fi ← 1−∑j<k(Ai,j−1)

ProbPhases(j, Ai,j−1, Di,j−1)3:

4: {Step 2: Determine Wi+1}5: t← ((k(Ai,j−1)− j)/µ(Ai,j−1)

6: E[Wi+1]←∑j<k(Ai,j−1)

(ProbPhases(j, Ai,j−1, Di,j−1) ∗ t)∑j<k(Ai,j−1)

ProbPhases(j, Ai,j−1, Di,j−1)

7: t← ((k(Ai,j−1)− j)(k(Ai,j−1)− j + 1))/µ(Ai,j−1)2

8: E[W 2i+1]←

∑j<k(Ai,j−1)

(ProbPhases(j, Ai,j−1, Di,j−1) ∗ t)∑j<k(Ai,j−1)

ProbPhases(j, Ai,j−1, Di,j−1)9:

10: {Step 3: Determine Ai+1,j}11: E[Ai+1, j]← fiE[Si+1,j ] + (1− fi)(E[Si+1,j ] + E[Wi+1])12: E[A2

i+1, j]← fiE[S2i+1,j ] + (1− fi)(E[S2

i+1,j ] + 2E[Si+1,j ]E[Wi+1] + E[W 2i+1])

13: return Ai+1,j

Algorithm 4 ProbPhases(j, Ai,j , Di,j)

1: return

(k(Di,j)− 1 + j

k(Di,j)− 1

)(µ(Ai,j)

µ(Ai,j) + µ(Di,j)

)j (µ(Di,j)

µ(Ai,j) + µ(Di,j)

)k(Di,j)

Algorithm 5 DetermineD(Si+1,j , Di+1,j , Ai,j−1, Di,j−1,Wi, fi)

1: {Step 1: Determine topmost dependency}2: D1← max(Si+1,j , Di+1,j)3:

4: {Step 2: Determine residual service time}5: t← ((k(Wi)− j)/µ(Wi)

6: E[R]←∑j<k(Wi)

(ProbPhases(j,Wi, Di+1,j) ∗ t)∑j<k(Wi)

ProbPhases(j,Wi, Di+1,j)

7: t← ((k(Wi)− j)(k(Wi)− j + 1))/µ(Wi)2

8: E[R2]←∑j<k(Wi)

(ProbPhases(j,Wi, Di+1,j) ∗ t)∑j<k(Wi)


9: R← FitDistribution(R)10:

11: {Step 3: Determine pi,j}12: pi,j ← 1−∑j<k(Wi)


13:

14: {Step 5: Determine lower dependency}15: B ← max(Si+1,j , R)16: E[D2]← pi,jE[B] + (1− pi,j)E[Si+1,j ]17: E[D22]← pi,jE[B2] + (1− pi,j)E[S2

i+1,j ]18:

19: {Step 6: Determine Di,j}20: E[Di,j ]← fiE[D1] + (1− fi)E[D2]21: E[D2

i,j ]← fiE[D12] + (1− fi)E[D22]22: return Di,j

54

Algorithm 6 InitPeriodicSTable(S′, period, maxcycles)

1: K ← length(S′)2: S ← array[1 . . .K, 0 . . .maxcycles]3: for all Si,j do4: Si,j ← 05: end for6: for i = 0 to K − 1 do7: n← 08: while (i+ n ∗ period) < maxcycles) do9: Si+1,i+n∗period ← S′i+1

10: n← n+ 111: end while12: end for13: return S

55

Appendix D

Mathematical operations

Fitting distributions

The EPT algorithm makes frequent use of stochastic variables. In fact, the algorithm relies onthe ability to capture the behaviour of each Mi by fitting a phase-type distribution on the firsttwo moments of the service time. Therefore, this section introduces the phase-type distributionsthat are used to fit stochastic variables.

The most general phase-type distribution is the Erlangk distribution. This distribution isnot fitted directly to a stochastic variable, but it is used as a basis in most of the calculationsperformed by the algorithm. The actual distributions that are fitted to the stochastic variables,the Erlangk−1,k and Hyper-exponential distributions, can all be conditioned to Erlangk distributedvariables. Therefore, it is convenient to express the calculations in terms of Erlangk distributedvariables. An Erlangk distribution consists of k exponentially distributed phases with rate µ.Figure D.1 shows the phase diagram of an Erlangk distributed variable.

1 2 k. . .

µ µ µ

Figure D.1: Phase diagram of an Erlangk distributed random variable.

One of the phase-type distributions the algorithm uses to fit stochastic variables is the Erlangk−1,kdistribution. This distribution consists of two Erlang distributions. With probability p, anErlangk−1,k distributed random variable is Erlangk−1 distributed with rate µ. Similarly, withprobability 1 − p an Erlangk−1,k distributed random variable is Erlangk distributed with rate µ.Figure D.2 shows the phase diagram for such a random variable.

1 2 k-1. . . k

µ µ µ µ

1-p

p

Figure D.2: Phase diagram of an Erlangk−1,k distributed random variable.

The other phase-type distribution that is used to fit stochastic variables is the Hyper-exponentialdistribution. This distribution consists of two exponential distributions. With probability p, aHyper-exponential distributed random variable is exponentially distributed with rate µ1. Simi-larly, with probability 1−p the random variable is exponentially distributed with rate µ2. A phasediagram for such a variable can be seen in Figure D.3. Note that an exponential distribution isequal to an Erlang1 distribution.

Table D.1 shows the properties of each distribution. Using these properties, the algorithm canfit probability distributions on stochastic variables. In order to decide which distribution is fitted,

56

1

µ1

1

µ2

1-p

p

Figure D.3: Phase diagram of a Hyper-exponential distributed random variable.

Erlangk Erlangk−1,k Hyper-exponential

E(X) kµ

k−pµ

pµ1

+ 1−pµ2

σ2X

kµ2

k−pµ2

pµ21

+ 1−pµ22

c2X1k

1k−p

(1−p)µ21+pµ

22

((p−1)µ1−pµ2)2

Table D.1: Properties of random variable X for various distributions.

the algorithm first calculates the squared coefficient of variation (scv). If this scv is lower than 1,an Erlangk−1,k distribution is fitted. Otherwise, a Hyper-exponential distribution is fitted. Thepseudocode corresponding to the fitting procedure can be found in Appendix C.

Conditioning variables

The previous section already mentioned that Erlangk−1,k and Hyper-exponential distributed vari-ables can be conditioned to Erlangk distributed variables. This conditioning is made explicit inFigure D.4. Furthermore, Table D.2 shows how the algorithm conditions a function that operateson two random variables. In this table, the function e(µ, k) constructs a new Erlangk distributedrandom variable with rate parameter µ. The algorithm itself performs this conditioning quiteextensively.

X ∼Erlangk−1,k(µ, k, p)

p

1− p

X ∼Erlangk−1(µ)

X ∼Erlangk(µ)

X ∼Hyper-exponential(µ1, µ2, p)

p

1− p

X ∼Erlang1(µ2)

X ∼Erlang1(µ1)

Figure D.4: Conditioning random variable X to an Erlangk distribution.

Determining the maximum

The algorithm frequently determines the maximum of two random variables. Due to the phase-type distributions that have been fitted, these variables are always either Erlangk−1,k or Hyper-exponentially distributed. Determining the maximum of two variables that follow a differentdistribution is not straightforward. Fortunately, the previous section presented a way to conditionthe variables such that a function operating on two random variables can be expressed as acombination of functions that only operate on Erlangk distributed variables. Therefore, this sectionpresents a function to determine the maximum of two Erlangk distributed random variables.

Suppose there are two Erlangk distributed random variables: E1 and E2. Variable E1 isErlangk1 distributed with rate µ1 and vice versa, E2 is Erlangk2 distributed with rate µ2. Themaximum of these variables is characterized by k1 + k2 phases. Note that, in order to determinethe maximum, it must be determined which rates belong to which phases. To this end, two casescan be distinguished:

1. Variable E1 finishes first and the remaining phases of E2 determine max{E1, E2}.

57

X Y ExpressionE1 E pXpY f(e(µX , kX − 1), e(µY , kY − 1))

+pX(1− pY )f(e(µX , kX − 1), e(µY , kY ))+(1− pX)pY f(e(µX , kX), e(µY , kY − 1))+(1− pX)(1− pY )f(e(µX , kX), e(µY , kY ))

E H2 pXpY f(e(µX , kX − 1), e(µY,2, 1)+pX(1− pY )f(e(µX , kX − 1), e(µY,1, 1))+(1− pX)pY f(e(µX , kX), e(µY,2, 1)+(1− pX)(1− pY )f(e(µX , kX), e(µY,1, 1))

H E pXpY f(e(µX,2, 1), e(µY , kY − 1)+pX(1− pY )f(e(µX,2, 1), e(µY , kY ))+(1− pX)pY f(e(µX,1, 1), e(µY , kY − 1)+(1− pX)(1− pY )f(e(µX,1, 1), e(µY , kY ))

H H pXpY f(e(µX,2, 1), e(µY,2, 1)+pX(1− pY )f(e(µX,2, 1), e(µY,1, 1))+(1− pX)pY f(e(µX,1, 1), e(µY,2, 1)+(1− pX)(1− pY )f(e(µX,1, 1), e(µY,1, 1))

1 E denotes an Erlangk−1,k distribution.2 H denotes a Hyper-exponential distribution.

Table D.2: Conditioning f(X,Y ).

2. Variable E2 finishes first and the remaining phases of E1 determine max{E1, E2}.

max{E1, E2}:

q1,j

q2,i

1 2 k1 + j k1 + j + 1 k1 + k2. . . . . .

µ1 + µ2 µ1 + µ2 µ1 + µ2 µ2 µ2

1 2 k2 + i k2 + i + 1 k1 + k2. . . . . .

µ1 + µ2 µ1 + µ2 µ1 + µ2 µ1 µ1

Figure D.5: Phase diagram for max{E1, E2}.

Due to this case distinction, the algorithm explicitly conditions on which variable finishes first.In case 1 this is E1 and in case 2 it is E2. The phase diagram of max{E1, E2} can be seen inFigure D.5. In this figure, probability q1,j denotes E1 finishes first when E2 already has j finishedphases and q2,i denotes E2 finishes first when E1 already has i finished phases. Note that in bothcases, the first j or i phases have a rate of µ1 + µ2. Only when one variable has finished, theremaining phases of the other variable have the rate belonging to that variable. Both probabilitiesq1,j and q2,i can be expressed as follows:

q1,j =

(k1 − 1 + j

k1 − 1

)(µ2

µ1 + µ2

)j (µ1

µ1 + µ2

)k1where 0 ≤ j ≤ k2 − 1 (D.1)

q2,i =

(k2 − 1 + i

k2 − 1

)(µ1

µ1 + µ2

)i(µ2

µ1 + µ2

)k2where 0 ≤ i ≤ k1 − 1 (D.2)

The conditional maximums that correspond to probabilities q1,j and q2,i are denoted by M1,j

and M2,i respectively. Their first two moments are expressed as follows:

58

E[M1,j(k1, k2)] =k1 + j

µ1 + µ2+k2 − jµ2

(D.3)

E[M21,j(k1, k2)] =

k1 + j

(µ1 + µ2)2+k2 − jµ22

+

(k1 + j

µ1 + µ2+k2 − jµ2

)2

(D.4)

E[M2,i(k1, k2)] =k2 + i

µ1 + µ2+k1 − iµ1

(D.5)

E[M22,i(k1, k2)] =

k2 + i

(µ1 + µ2)2+k1 − iµ21

+

(k2 + i

µ1 + µ2+k1 − iµ1

)2

(D.6)

Using equations (D.1) up to (D.6), the first two moments of max{E1, E2} can be conditionedon all possible values of i and j. This yields the following equations:

E[max{E1, E2}] =

k2−1∑j=0

q1,jE[M1,j(k1, k2)] +

k1−1∑i=0

q2,iE[M2,i(k1, k2)] (D.7)

E[max{E1, E2}2] =

k2−1∑j=0

q1,jE[M21,j(k1, k2)] +

k1−1∑i=0

q2,iE[M22,i(k1, k2)] (D.8)

If E1 and E2 consist of an equal deterministic offset and a stochastic part, the procedurementioned above is slightly different. Assume that Ei = Y +Zi, for i ∈ {1, 2}. Then the followingequations characterize the first two moments of the maximum:

E[max{E1, E2}] = Y + E[max{Z1, Z2}] (D.9)

E[max{E1, E2}2] = Y 2 + 2Y E[max{Z1, Z2}] + E[max{Z1, Z2}2] (D.10)

Note that as Z1 and Z2 are stochastic variables without offset, equations D.7 and D.8 can beused to the first two moments of max{Z1, Z2}.

59

Appendix E

Matlab code of the adapted EPTalgorithm

EvalEPTO

function [] = evalEPTO( S, period, numiterations )

% Suppress the warning that nchoosek generates, it pops up in determining

% fi. Unfortunately, this is not doable symbolically in Matlab.

warning off MATLAB:nchoosek:LargeCoefficient

format(’shortG’);

% determine if the input is valid and the number of machines in the line

[rows_S, cols_S] = size(S);

numservers = rows_S;

if cols_S ~= 3

disp(’The matrix provided as argument should have two columns!’);

return;

end

if numservers < 2

disp(’It is required to have at least 2 machines in the line!’);

return;

end

% STEP 1: Initialization

% S(i,1) and S(i,2) are respectively the mean and variance of X. S(i,3) is

% the offset C (recall X = C + Z). Now we first have to infer the mean and

% variance of Z.

Z = zeros(numservers,2);

for i = 1:numservers

Z(i,1) = S(i,1)-S(i,3);

xm2 = S(i,2) + S(i,1)*S(i,1);

Z(i,2) = xm2 + S(i,3)*S(i,3) - 2*S(i,3)*S(i,1);

end

% Save characteristics of machines and fit distributions.

SS = zeros(period, numservers, 10);

sidx = 1;

for i = 1:numservers

60

SS(sidx,i,1) = Z(i,1);

SS(sidx,i,2) = Z(i,2);

SS(sidx,i,8) = S(i,3);

SS(sidx,i,9) = S(i,3) + Z(i,1);

SS(sidx,i,10) = S(i,2);

sidx = sidx+1;

if sidx == period+1

sidx = 1;

end

end

for i = 1:period

for j = 1:numservers

% fit distributions

[SS(i,j,1), SS(i,j,2), SS(i,j,3), SS(i,j,4), SS(i,j,5), ...

SS(i,j,6), SS(i,j,7)] = fitScenario2(SS(i,j,1), SS(i,j,2));

end

end

% decompose into subsystems and perform initialization

numsubsys = numservers-1;

A = zeros(numsubsys,10); % characteristics of arrival servers

D = zeros(numsubsys,10); % characteristics of departure servers

for i = 1:numsubsys

A(i,1) = SS(1,i,1);

A(i,2) = SS(1,i,2);

A(i,8) = SS(1,i,8);

A(i,9) = SS(1,i,9);

A(i,10) = SS(1,i,10);

D(i,1) = SS(1,i+1,1);

D(i,2) = SS(1,i+1,2);

D(i,8) = SS(1,i+1,8);

D(i,9) = SS(1,i+1,9);

D(i,10) = SS(1,i+1,10);

end

% A(i, 1) is the mean of Z of machine Ai, A(i, 2) is the variance of Z of

% machine Ai, A(i, 3) is the ’type’ of machine Ai (0 = Erlang_{k,k-1}, 1 =

% H-Exp.), A(i, 4) is the number of phases (k) of machine Ai, A(i, 5) is p

% of machine Ai, A(i, 6) is the first rate of machine Ai and, if ’type’ =

% H-Exp., A(i, 7) is the second rate of machine Ai. A(i,8) is the offset of

% machine Ai (C), while A(i,9) and A(i,10) are the mean and variance of Ai

% respectively. The same holds for D(i, ...).

% initialize storage for calculation of statistics

s_A = zeros(period, numsubsys, 10);

s_D = zeros(period, numsubsys, 10);

s_D_real = zeros(numsubsys-1, 10);

s_idx = 1;

% preallocate result vectors

ldat_z = zeros(numiterations, 1);

ldat_tp = zeros(numsubsys, numiterations);

avgdelay = zeros(numiterations, 1);

avgtp = zeros(numiterations, 1);

61

% Store RA_i and fi for each subsystem in each iteration.

s_RA = zeros(numsubsys-1,10);

s_fi = zeros(1,numsubsys-1);

sidx = 1; % service time index

for z = 1:numiterations

% update service times of A1 and DN periodically

A(1,1) = SS(sidx,1,1);

A(1,2) = SS(sidx,1,2);

A(1,8) = SS(sidx,1,8);

A(1,9) = SS(sidx,1,9);

A(1,10) = SS(sidx,1,10);

D(numsubsys,1) = SS(sidx,numservers,1);





% Fit a phase-type dist. on all subsystems and save the

% number of phases

for i = 1:numsubsys

% account for offsets

A(i,1) = A(i,9) - A(i,8);

am2 = A(i,10) + A(i,9)*A(i,9);

A(i,2) = am2 + A(i,8)*A(i,8) - 2*A(i,8)*A(i,9);

D(i,1) = D(i,9) - D(i,8);

dm2 = D(i,10) + D(i,9)*D(i,9);

D(i,2) = dm2 + D(i,8)*D(i,8) - 2*D(i,8)*D(i,9);

% fit stochastic part (Z)

[A(i, 1), A(i, 2), A(i, 3), A(i, 4), A(i, 5), A(i, 6), ...

A(i, 7)] = fitScenario2(A(i, 1), A(i, 2));

[D(i, 1), D(i, 2), D(i, 3), D(i, 4), D(i, 5), D(i, 6), ...

D(i, 7)] = fitScenario2(D(i, 1), D(i, 2));

end

s_A(s_idx,1,:) = A(1,:); % update A(1)

s_D(s_idx,numsubsys,:) = D(numsubsys,:); % update D(N)

for i = 1:numsubsys-1

% shorthands

mean_a_z = A(i, 1);

var_a_z = A(i, 2);

type_a = A(i, 3);

rate_a_1 = A(i, 6);

rate_a_2 = A(i, 7);

type_d = D(i, 3);

rate_d_1 = D(i, 6);

rate_d_2 = D(i, 7);

k_a = A(i, 4);

k_d = D(i, 4);

p_a = A(i, 5);

p_d = D(i, 5);

62

% determine fi

fi = evalFi(type_a, p_a, k_a, rate_a_1, rate_a_2, ...

type_d, p_d, k_d, rate_d_1, rate_d_2);

s_fi(i) = fi;

% determine characteristics of RAi

[s_RA(i, 1), s_RA(i, 2)] = evalRAi(mean_a_z, var_a_z, type_a,...

p_a, k_a, rate_a_1, rate_a_2, type_d, p_d, k_d, rate_d_1, ...

rate_d_2);

% determine characteristics of A(i+1,)

[m1_S, m2_S] = meanvar2moments(SS(sidx,i+1,9), SS(sidx,i+1,9));

[m1_RA, m2_RA] = meanvar2moments(s_RA(i,1), s_RA(i,2));

A_next_m1 = fi * m1_S + (1-fi) * (m1_S + m1_RA);

A_next_m2 = fi * m2_S + (1-fi) * (m2_S + 2*m1_S*m1_RA + m2_RA);

[A(i+1,9), A(i+1,10)] = moments2meanvar(A_next_m1, A_next_m2);

A(i,8) = SS(sidx,i,8);

% store the value of A to later calculate statistics

s_A(s_idx,i+1,:) = A(i+1,:);

end

% STEP 3: The departure process

% Fit a phase-type dist. on all stored RA_i


[s_RA(i, 1), s_RA(i, 2), s_RA(i, 3), s_RA(i, 4), s_RA(i, 5), ...

s_RA(i, 6), s_RA(i, 7)] = fitScenario2(s_RA(i, 1), s_RA(i, 2));

end

for i = numsubsys-1:-1:1

% fit distribution on D_{i+1}

D(i+1,1) = D(i+1,9) - D(i+1,8);

dm2 = D(i+1,10) + D(i+1,9)*D(i+1,9);

D(i+1,2) = dm2 + D(i+1,8)*D(i+1,8) - 2*D(i+1,8)*D(i+1,9);

[D(i+1, 1), D(i+1, 2), D(i+1, 3), D(i+1, 4), D(i+1, 5), ...

D(i+1, 6), D(i+1, 7)] = fitScenario2(D(i+1, 1), D(i+1, 2));

% shorthands

mean_s_z = SS(sidx,i+1,1);

var_s_z = SS(sidx,i+1,2);

mean_s = SS(sidx,i+1,9);

var_s = SS(sidx,i+1,10);

offset_s = SS(sidx,i+1,8);

mean_d_next_z = D(i+1, 1);

var_d_next_z = D(i+1, 2);

offset_d_next = D(i+1, 8);

type_d_next = D(i+1, 3);

p_d_next = D(i+1, 5);

rate_d_next_1 = D(i+1, 6);

rate_d_next_2 = D(i+1, 7);

type_s = SS(sidx,i+1,3);

p_s = SS(sidx,i+1,5);

rate_s_1 = SS(sidx,i+1,6);

63

rate_s_2 = SS(sidx,i+1,7);

mean_ra_i = s_RA(i, 1);

var_ra_i = s_RA(i, 2);

type_ra_i = s_RA(i, 3);

p_ra_i = s_RA(i, 5);

rate_ra_i_1 = s_RA(i, 6);

rate_ra_i_2 = s_RA(i, 7);

k_d_next = D(i+1, 4);

k_s = SS(sidx,i+1,4);

k_ra_i = s_RA(i, 4);

fi = s_fi(i);

% determine mean and variance of D_i_ne (max(S_{i+1}, D_{i+1}))

[d_ne_m1, d_ne_m2] = evalMaxO(mean_s_z, var_s_z, offset_s, ...

type_s, p_s, k_s, rate_s_1, rate_s_2, mean_d_next_z, ...

var_d_next_z, offset_d_next, type_d_next, p_d_next, ...

k_d_next, rate_d_next_1, rate_d_next_2);

% determine mean and variance of RD_{i+1} (similar to RAi)

[rd_next_mean_z, rd_next_var_z] = evalRAi(mean_ra_i, var_ra_i, ...

type_ra_i, p_ra_i, k_ra_i, rate_ra_i_1, rate_ra_i_2, ...

type_d_next, p_d_next, k_d_next, rate_d_next_1, rate_d_next_2);

% determine pi (similar to fi)

pi = evalFi(type_ra_i, p_ra_i, k_ra_i, rate_ra_i_1, rate_ra_i_2,...

type_d_next, p_d_next, k_d_next, rate_d_next_1,...

rate_d_next_2);

% determine mean and variance of D_i_e

% first fit a distribution on RD_{i+1} and determine the maximum of

% RD_{i+1} and S_{i}

[~, ~, rd_next_type, rd_next_k, rd_next_p, rd_next_mu1, ...

rd_next_mu2] = fitScenario2(rd_next_mean_z, rd_next_var_z);

[d_e_blocked_m1, d_e_blocked_m2] = evalMaxO(mean_s_z, var_s_z, ...

offset_s, type_s, p_s, k_s, rate_s_1, rate_s_2, ...

rd_next_mean_z, rd_next_var_z, offset_d_next, rd_next_type, ...

rd_next_p, rd_next_k, rd_next_mu1, rd_next_mu2);

% now determine the mean and variance of D_i_e

[s_m1, s_m2] = meanvar2moments(mean_s, var_s);

d_e_m1 = pi * d_e_blocked_m1 + (1-pi) * s_m1;

d_e_m2 = pi * d_e_blocked_m2 + (1-pi) * s_m2;

% determine the mean and variance of D_{i}

d_m1 = fi * d_ne_m1 + (1-fi) * d_e_m1;

d_m2 = fi * d_ne_m2 + (1-fi) * d_e_m2;

[D(i, 9), D(i, 10)] = moments2meanvar(d_m1, d_m2);

D(i,8) = SS(sidx,i+1,8);

% store the value of D to later calculate statistics

s_D(s_idx,i,:) = D(i,:);

% update the last-known value for real products processed by D_i

if(mean_s ~= 0)

% assuming a mean service time of 0 is associated with dummy

64

% products

s_D_real(i,:) = D(i,:);

end

end

s_idx = s_idx+1; % increment periodic storage counter

if s_idx == period+1

s_idx = 1;

end



s_D_real(i,1) = s_D_real(i,9) - s_D_real(i,8);

sdrm2 = s_D_real(i,10) + s_D_real(i,9)*s_D_real(i,9);

s_D_real(i,2) = sdrm2 + s_D_real(i,8)*s_D_real(i,8) - ...

2*s_D_real(i,8)*s_D_real(i,9);

[s_D_real(i,1), s_D_real(i,2), s_D_real(i,3), s_D_real(i,4), ...

s_D_real(i,5), s_D_real(i,6), s_D_real(i,7)] = ...

fitScenario2(s_D_real(i,1), s_D_real(i,2));

end

for i = 1:period

for j = 1:numsubsys


s_A(i,j,1) = s_A(i,j,9) - s_A(i,j,8);

sam2 = s_A(i,j,10) + s_A(i,j,9)*s_A(i,j,9);

s_A(i,j,2) = sam2 + s_A(i,j,8)*s_A(i,j,8) - ...

2*s_A(i,j,8)*s_A(i,j,9);

s_D(i,j,1) = s_D(i,j,9) - s_D(i,j,8);

sdm2 = s_D(i,j,10) + s_D(i,j,9)*s_D(i,j,9);

s_D(i,j,2) = sdm2 + s_D(i,j,8)*s_D(i,j,8) - ...

2*s_D(i,j,8)*s_D(i,j,9);

[s_A(i,j,1), s_A(i,j,2), s_A(i,j,3), s_A(i,j,4), s_A(i,j,5),...

s_A(i,j,6), s_A(i,j,7)] = ...

fitScenario2(s_A(i,j,1), s_A(i,j,2));

[s_D(i,j,1), s_D(i,j,2), s_D(i,j,3), s_D(i,j,4), s_D(i,j,5),...

s_D(i,j,6), s_D(i,j,7)] = ...

fitScenario2(s_D(i,j,1), s_D(i,j,2));

end

end

% create plot data

if z > 0

ldat_z(z) = z;

end

% save difference in throughput

if z > period-1

sum_d = 0;

for i=1:numsubsys-1

sum_d = sum_d + s_D_real(i, 9);

end

dep_1_real = evalMaxO(A(1,1), A(1,2), A(1, 8), A(1,3), A(1,5), ...

A(1,4), A(1,6), A(1,7), s_D_real(1,1), s_D_real(1,2), ...

65

s_D_real(1,8), s_D_real(1,3), s_D_real(1,5), s_D_real(1,4), ...

s_D_real(1,6), s_D_real(1,7));

avgdelay(z) = sum_d + D(numsubsys, 1) + dep_1_real;

end

% save throughput and average throughput

ldat_tp(1, 1) = 0;

if z > period-1

for i=1:numsubsys

periodlength = 0;

for j=1:period

periodlength = periodlength + evalMaxO(s_A(j,i,1), ...

s_A(j,i,2), s_A(j,i,8), s_A(j,i,3), s_A(j,i,5), ...

s_A(j,i,4), s_A(j,i,6), s_A(j,i,7), s_D(j,i,1), ...

s_D(j,i,2), s_D(j,i,8), s_D(j,i,3), s_D(j,i,5), ...

s_D(j,i,4), s_D(j,i,6), s_D(j,i,7));

end

ldat_tp(i,z) = 1/periodlength;

end

end

if z > period-1

tpsum = 0;

for i=1:numsubsys

tpsum = tpsum + ldat_tp(i,z);

end

avgtp(z) = tpsum / numsubsys;

end

% update service time index

sidx = mod(z, period) + 1;

end

end

EvalFi

function [ fi ] = evalFi( type_a, p_a, k_a, rate_1_a, rate_2_a, ...

type_b, p_b, k_b, rate_1_b, rate_2_b )

% initialize storage for the probabilities

p = zeros(1, 4);

% correct for infinite rates (i.e. service times of 0)

if rate_1_a == Inf && rate_1_b == Inf

fi = 1; % both rates infinite

return;

elseif rate_1_a == Inf

fi = 1; % service time of a is always the lowest

return;

elseif rate_1_b == Inf

fi = 0; % service time of b is always the lowest

return;

end

66

% condition the variables based on their distribution, yields four cases

[cond(1), cond(2), cond(3), cond(4)] = conditionVars(type_a, p_a, k_a, ...

rate_1_a, rate_2_a, type_b, p_b, k_b, rate_1_b, rate_2_b);

% evaluate probability fi in all cases and sanitize the output

for i = 1:4

p(i) = evalSumQ(cond(i).k1-1, cond(i).k2, cond(i).rate1,cond(i).rate2);

if ~isfinite(p(i))

p(i) = 0;

end

end

% calculate total probability

fi = 1 - (cond(1).p*p(1)+cond(2).p*p(2)+cond(3).p*p(3)+cond(4).p*p(4));

end

EvalMaxErlang

function [ m1, m2 ] = evalMaxErlang( k_1, k_2, rate_1, rate_2 )

sum_a_mean = 0;

sum_a_var = 0;

q_term_c = (rate_1 / (rate_1+rate_2))^k_1; % term irrespective of j

for j=0:k_2-1

q_term_a = nchoosek(k_1-1+j, k_1-1);

q_term_b = (rate_2 / (rate_1+rate_2))^j;

q_term = q_term_a*q_term_b*q_term_c;

m_term_mean = (k_1 + j)/(rate_1 + rate_2) + (k_2 - j)/rate_2;

m_term_var = (k_1+j)/((rate_1+rate_2)^2) + (k_2-j)/(rate_2^2) + ...

((k_1+j)/(rate_1+rate_2) + (k_2-j)/rate_2)^2;

sum_a_mean = sum_a_mean + q_term * m_term_mean;

sum_a_var = sum_a_var + q_term * m_term_var;

end

sum_b_mean = 0;

sum_b_var = 0;

q_term_c = (rate_2 / (rate_1+rate_2))^k_2; % term irrespective of i

for i=0:k_1-1

q_term_a = nchoosek(k_2-1+i, k_2-1);

q_term_b = (rate_1 / (rate_1+rate_2))^i;

q_term = q_term_a*q_term_b*q_term_c;

m_term_mean = (k_2 + i)/(rate_1 + rate_2) + (k_1 - i)/rate_1;

m_term_var = (k_2+i)/((rate_1+rate_2)^2) + (k_1-i)/(rate_1^2) + ...

((k_2+i)/(rate_1+rate_2) + (k_1-i)/rate_1)^2;

sum_b_mean = sum_b_mean + q_term * m_term_mean;

sum_b_var = sum_b_var + q_term * m_term_var;

end

m1 = sum_a_mean + sum_b_mean;

m2 = sum_a_var + sum_b_var;

end

67

EvalMaxO

function [ m1, m2 ] = evalMaxO( mean_a_z, var_a_z, offset_a, type_a, ...

p_a, k_a, rate_1_a, rate_2_a, mean_b_z, var_b_z, offset_b, type_b, ...

p_b, k_b, rate_1_b, rate_2_b)

% initialize storage for characteristics

m = zeros(1, 4);

v = zeros(1, 4);



m1 = 0; % both rates infinite

m2 = 0;

return;


m1 = mean_b_z;

m2 = var_b_z + (m1*m1); % a is always 0, so b determines max

return;


m1 = mean_a_z;

m2 = var_a_z + (m1*m1); % b is always 0, so a determines max

return;

end




% evaluate the maximum in all cases and sanitize the output

for i = 1:4

[m(i), v(i)] = evalMaxErlang(cond(i).k1, cond(i).k2, ...

cond(i).rate1, cond(i).rate2);

if ~isfinite(m(i))

m(i) = 0;

end

if ~isfinite(v(i))

v(i) = 0;

end

end

% output characteristics

zm1 = cond(1).p*m(1) + cond(2).p*m(2) + ...

cond(3).p*m(3) + cond(4).p*m(4);

zm2 = cond(1).p*v(1) + cond(2).p*v(2) + ...

cond(3).p*v(3) + cond(4).p*v(4);

m1 = offset_a + zm1;

m2 = offset_a*offset_a + 2*offset_a*zm1 + zm2;

end

EvalQ

function [ val ] = evalQ( j, k_d, rate_a, rate_d )

68

q_term_a = nchoosek(k_d-1+j, k_d-1);

q_term_b = (rate_a / (rate_a+rate_d))^j;

q_term_c = (rate_d / (rate_a+rate_d))^k_d;

val = q_term_a * q_term_b * q_term_c;

end

EvalRAi

function [ mean, var ] = evalRAi( mean_a, var_a, type_a, p_a, k_a, ...

rate_1_a, rate_2_a, type_b, p_b, k_b, rate_1_b, rate_2_b )

% initialize storage for characteristics

m = zeros(1, 4);

v = zeros(1, 4);



mean = 0;

var = 0; % both rates infinite

return;


mean = 0;

var = 0; % arrival server is always 0, so no starvation

return;


mean = mean_a;

var = var_a; % dep. server is always 0, so arrival server

return; % determines starvation

end




% evaluate RA in all cases and sanitize the output

for i = 1:4

[m(i), v(i)] = evalRAsub(cond(i).k1, cond(i).k2, ...

cond(i).rate1, cond(i).rate2);

if ~isfinite(m(i))

m(i) = 0;

end

if ~isfinite(v(i))

v(i) = 0;

end

end

% output characteristics

mean = cond(1).p*m(1) + cond(2).p*m(2) + ...

cond(3).p*m(3) + cond(4).p*m(4);

m2 = cond(1).p*v(1) + cond(2).p*v(2) + ...

cond(3).p*v(3) + cond(4).p*v(4);

69

var = m2 - (mean*mean);

end

EvalRAsub

function [ m, v ] = evalRAsub( k_a, k_d, rate_a, rate_d )

ra_i_num = 0;

ra_i_den = 0;

ra_i_var_num = 0;

% determine the first two moments

for j = 0:k_a-1

term_q = evalQ(j, k_d, rate_a, rate_d);

term_m = (k_a-j)/rate_a;

term_var = ((k_a-j)*(k_a-j+1))/(rate_a*rate_a);

ra_i_num = ra_i_num + term_q * term_m;

ra_i_den = ra_i_den + term_q;

ra_i_var_num = ra_i_var_num + term_q * term_var;

end

m = ra_i_num / ra_i_den;

v = ra_i_var_num / ra_i_den;

end

evalSumQ

function [ val ] = evalSumQ( upperbound, k_d, rate_a, rate_d )

val = 0;

q_term_c = (rate_d / (rate_a+rate_d))^k_d; % term irrespective of j

for j=0:upperbound

q_term_a = nchoosek(k_d-1+j, k_d-1);

q_term_b = (rate_a / (rate_a+rate_d))^j;

val = val + q_term_a * q_term_b * q_term_c;

end

end

fitScenario2

function [ mean, variance, type, k, p, mu1, mu2 ] = fitScenario2( m, v )

% determine the SCV

if v == 0 || m == 0

scv = 0.1;

else

scv = v/(m*m);

end

if scv <= 1

% fit an Erlang_{k-1,k} distribution

type = 0;

k = ceil(1/scv);

p = (1/(1+scv)) * (k*scv - sqrt(k*(1+scv) - k*k*scv));

mu1 = (k-p)/m;

mu2 = 0;

70

% return parameters of distribution

mean = m;

variance = v;

else

% fit a Hyper-exponential distribution

type = 1;

mu1 = 2/m;

a = 1/(2*scv);

mu2 = a*mu1;

p = 1 - a/(1-(mu2/mu1));

mean = m;

variance = v;

k = 1; % always consists out of 1 phase

end

end

meanvar2moments

function [ m1, m2 ] = meanvar2moments( mean, var )

m1 = mean;

m2 = var + (mean*mean);

end

moments2meanvar

function [ mean, var ] = moments2meanvar( m1, m2 )

mean = m1;

var = m2 - (m1*m1);

end

selectRate

function [ rate ] = selectRate( type, p, mu1, mu2 )

if type == 0

rate = mu1;

else

rate = 1 / (p/mu1 + (1-p)/mu2);

end

71

eindhoven university of technology master analyzing ribosome … · codons instruct the ribosomes...

Documents