inference of isoforms from short sequence reads (extended abstract)

24
Inference of Isoforms from Short Sequence Reads (Extended Abstract) Jianxing Feng 1 , Wei Li 2 , and Tao Jiang 3 1 Comp. Sci. Dept., Tsinghua Univ., Beijing, China [email protected] 2 Comp. Sci. Dept., Univ. of California, Riverside, CA [email protected] 3 Comp. Sci. Dept., Univ. of California, Riverside, CA, and Tsinghua Univ., Beijing [email protected] Abstract. Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult prob- lem. Traditional experimental methods for this purpose are time con- suming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the ad- vantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from mil- lions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calcu- late the expression levels of isoforms and infer isoforms from short RNA- Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our exper- imental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expres- sion levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron bound- ary, TSS and PAS information, especially for isoforms whose expression levels are significantly high. 1 Introduction Transcriptome study (or transcriptomics) aims to discover all the transcripts and their quantities in a cell or an organism under different external environmental conditions. A large amount of work has been devoted to transcriptomics, which includes the international projects EST [1, 2], FANTOM [3], and ENCODE [4, 5]. Many technologies have been introduced in recent years including array-based experimental methods such as tiling arrays [6], exon arrays [7], and exon-junction

Upload: lethuan

Post on 02-Jan-2017

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

Inference of Isoforms from Short SequenceReads (Extended Abstract)

Jianxing Feng1, Wei Li2, and Tao Jiang3

1 Comp. Sci. Dept., Tsinghua Univ., Beijing, [email protected]

2 Comp. Sci. Dept., Univ. of California, Riverside, [email protected]

3 Comp. Sci. Dept., Univ. of California, Riverside, CA, and Tsinghua Univ., [email protected]

Abstract. Due to alternative splicing events in eukaryotic species, theidentification of mRNA isoforms (or splicing variants) is a difficult prob-lem. Traditional experimental methods for this purpose are time con-suming and cost ineffective. The emerging RNA-Seq technology providesa possible effective method to address this problem. Although the ad-vantages of RNA-Seq over traditional methods in transcriptome analysishave been confirmed by many studies, the inference of isoforms from mil-lions of short sequence reads (e.g., Illumina/Solexa reads) has remainedcomputationally challenging. In this work, we propose a method to calcu-late the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) andpoly-A site (PAS) information. We first formulate the relationship amongexons, isoforms, and single-end reads as a convex quadratic program, andthen use an efficient algorithm (called IsoInfer) to search for isoforms.IsoInfer can calculate the expression levels of isoforms accurately if allthe isoforms are known and infer novel isoforms from scratch. Our exper-imental tests on known mouse isoforms with both simulated expressionlevels and reads demonstrate that IsoInfer is able to calculate the expres-sion levels of isoforms with an accuracy comparable to the state-of-the-artstatistical method and a 60 times faster speed. Moreover, our tests onboth simulated and real reads show that it achieves a good precision andsensitivity in inferring isoforms when given accurate exon-intron bound-ary, TSS and PAS information, especially for isoforms whose expressionlevels are significantly high.

1 Introduction

Transcriptome study (or transcriptomics) aims to discover all the transcripts andtheir quantities in a cell or an organism under different external environmentalconditions. A large amount of work has been devoted to transcriptomics, whichincludes the international projects EST [1, 2], FANTOM [3], and ENCODE [4,5]. Many technologies have been introduced in recent years including array-basedexperimental methods such as tiling arrays [6], exon arrays [7], and exon-junction

Page 2: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

arrays [8, 9], and tag-based approaches such as MPSS [10, 11], SAGE [12, 13],CAGE [14, 15], PMAGE [16], and GIS [17]. However, due to various constraintsintrinsic to these technologies, the speed of advance in transcriptomics is farfrom being satisfactory, especially on eukaryotic species because of widespreadalternative splicing events.

Applying next generation sequencing technologies to transcriptomes, the re-cently developed RNA-Seq technology is quickly becoming an important tool infunctional genomics and transcriptomics. It can be used to identify all genes andexons and their boundaries [18, 19] and to study gene functions and perform tran-scriptome analysis [20]. For example, based on an unannotated genomic sequenceand millions of short reads from RNA-Seq, [21] developed a general method forthe discovery of a complete transcriptome, including the identification of codingregions, ends of transcripts, splice junctions, splice site variations, etc. Their ap-plication of the method to S.cerevisiae (yeast) showed a high degree of agreementwith the existing knowledge of the yeast transcriptome. Besides yeast [22, 18],RNA-Seq has been applied to the transcriptome analysis of mouse [23, 24] andhuman [25, 26]. These results demonstrate that RNA-Seq is a powerful quantita-tive method to sample a transcriptome deeply at an unprecedented resolution.Moreover, DNA sequencing technologies are under fast development. Some ofthem now could provide long reads, paired-end reads, DNA-strand-sequencing ofmRNA transcripts, etc. See [27] for a comprehensive analysis of the advantagesof RNA-Seq over traditional methods in genome-wide transcriptome analysis,and the challenges faced by this technology.

Very recently, several methods have been proposed to characterize the expres-sion level of each transcript [28, 29] using RNA-Seq data. In [28], the authorsshowed that short (single-end or paired-end) read sequences cannot theoreti-cally guarantee a unique solution to the transcriptome reconstruction problem(i.e., the reconstruction of all expressed isoforms and their expression levels) ingeneral even if the reads are sampled perfectly according to the length of eachtranscript (without random distortions and noise). However, under the sameassumption, the authors also showed that paired-end reads could help recon-struct the transcripts uniquely and determine their expression levels for mostof the currently known isoforms of human, and single-end reads could allow usto determine the expression levels correctly if all the isoforms are known. How-ever, these results are mostly of theoretical interest because sequence read dataare random in nature and may contain noise in practice. [29] presented a morepractical way to estimate the expression levels of known isoforms. The methoduses maximum likelihood estimation followed by importance sampling from theposterior distribution.

The availability of all the isoforms is the basis of the accurate estimation ofisoform expression levels [29], which could be used to infer all splicing variantsquantitatively and qualitatively. The variations in isoform expression levels andsplicing are important for many studies, e.g., the study of diseases [30, 31] anddrug development [32]. A lot of effort has been devoted to the identification oftranscripts/isoforms from the more traditional EST, cDNA data. Instead of a

2

Page 3: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

comprehensive review, we will just name a few results below. To enumerate allpossible isoforms, a core ingredient is the splicing graph [33, 34]. A predeterminedparameter “dimension” decides how many transcripts are compared simultane-ously. The parameter is usually fixed to two, but recently, [34] extended themethod to arbitrary dimensions. There are several methods that assemble tran-scripts from EST data using the splicing graph and its variations [35, 36]. Newlyproposed experimental methods in [37, 38] could be used to identify new iso-forms. However, it is still unclear whether these methods can be applied in alarge scale.

RNA-Seq has shown great success in transcriptome analysis, but it has notbeen used to infer isoforms. Although it is straightforward to infer the existenceof novel isoforms from RNA-Seq data that exhibit novel transcribed regions[24, 6], it is not so obvious how to use RNA-Seq data to infer the existence ofnovel isoforms in known transcribed regions, because the observed reads couldbe sampled from either known or unknown isoforms. The problem has remainedchallenging for two reasons. The first is that RNA-Seq reads are usually veryshort. The second is due to the randomness and biases of the reads sampled fromall the transcripts. In fact, to our best knowledge, there has been no publishedwork to computationally infer isoforms from (realistic) short RNA-Seq reads.

Due to the high combinatorial complexity of isoforms of genes with a (mod-erately) large number of exons, the inference of isoforms from short reads (andother available biological information) should be realistically divided into twosub-problems. The first is to discover all the exon-intron boundaries as well asthe transcription start site (TSS) and poly-A site (PAS) of each isoform. Asmentioned above, there are several effective methods for detecting exon-intronboundaries from RNA-Seq read data [18, 19]. The identification of TSS’s andPAS’s is an indispensable part of many large genomics projects [3, 4]. The tech-nology of GIS-PET (Gene Identification Signature Paired-End Tags) can alsobe used to identify TSS-PAS pairs [17, 39]. The second sub-problem is to findcombinations of exons that can properly explain the RNA-Seq data, given theexon-intron boundary and TSS-PAS pair information.

In this paper, we are concerned with the second sub-problem in isoform infer-ence. Assuming that the exon-intron boundary and TSS-PAS pair informationis given, we propose a method (called IsoInfer) to infer isoforms from shortRNA-Seq reads (e.g., Illumina/Solexa data). Although our method works forsingle-end data and data with both single-end and paired-end reads, we will usesingle-end reads as the primary source of data and paired-end reads as a sec-ondary data which can be used to filter out false positives. We formulate therelationship among exons, isoforms, and single-end reads as a convex quadraticprogram, and design an efficient algorithm to search for isoforms. Our methodcan calculate the expression levels of isoforms accurately if all the isoforms areknown. To demonstrate this, we have compared IsoInfer with the simple count-ing method in [40, 41] and the method in [29] on simulated expression levels andreads, and found that our method is much more accurate than the simple count-ing method and has a comparable accuracy as the method in [29] but is 60 times

3

Page 4: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

faster. Most importantly, IsoInfer can infer isoforms from scratch when they aresufficiently expressed, by trying to find a minimum set of isoforms to explain theread data. Our experimental tests on both simulated and real reads show thatit is possible to infer the precise combination of exons in a sufficiently expressedisoform from RNA-Seq short read data with a reasonably good accuracy, whenaccurate exon-intron boundary and TSS-PAS pair information is provided. Toour best knowledge, this is the first computational method to infer isoforms fromshort RNA-Seq reads. Due to the page limit, some proofs and tables are omittedin this extended abstract but can be found in the full paper [42]

2 Methods

2.1 Assumptions and terminology

Traditionally, only five types of alternating splicing (AS) events have been pro-posed, including exon skipping, mutually exclusive exons, intron retention, alter-native donor and acceptor sites [43]. However, these events are not adequate todescribe complex AS events as more experimental knowledge has become avail-able [44]. In this work, we describe isoforms or AS events in a much general way,which is referred to as a “bit matrix” in [44].

Fig. 1. Expressed segments. Every exon-intron boundary introduces a boundary ofsome segment. Every expressed segment is a part of an exon.

The exon-intron boundaries of a gene divide the gene into disjoint segments,as shown in Figure 1. A segment is expressed if it has mapped reads. Thus, everyexpressed isoform consists of a subset of expressed segments. Two segments areadjacent if they are adjacent in the reference genome (i.e., they share a commonboundary). For example, in Figure 1, s2 and s3 are adjacent but s1 and s2 arenot. Any two segments may form a segment junction which is not necessarily anexon junction in the traditional sense. For example, s2 and s3 form a segmentjunction, which is not an exon junction. In the following, “junction” refers to“segment junction” unless otherwise stated.

As stated in the introduction, we first assume that exon-intron boundaries areknown. Our second assumption is that the short reads are uniformly randomlysampled from all the expressed isoforms (i.e., mRNA transcripts). We have tofurther assume that the short reads have been mapped to the referenced genome.The mapping of RNA-Seq reads can be done by many recent tools, e.g., Bowtie[45], Maq [46], SOAP [47], RNA-MATE [48] and mrFAST [49]. The mapping ofmulti-reads (i.e., reads that match several locations of the reference genome) isaddressed in [24, 50]. We will use Bowtie in our work due to its efficiency and

4

Page 5: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

accuracy. The last assumption concerns paired-end reads, which will be statedin section 2.3.

2.2 Quadratic programming formulation

G denotes the set of all the genes. Each g gene defines a set of expressed segmentsSg = s1, s2, . . . , s|Sg| (given exon-intron boundaries), where the expressed seg-ments are sorted according to their positions in the reference genome. The junc-tions on this gene are all the pairs of expressed segments (si, sj), 1 ≤ i < j ≤ |Sg|.The length of segment si is li. Denote the set of all known isoforms of this geneas Fg. Each isoform f ∈ Fg consists of a subset of expressed segments. The ex-pression level (i.e., the number of reads per base) of isoform f is denoted by xf .The sum of the length of all transcripts, weighted by their expression levels, overall genes, is L0 = C ·

∑g∈G

∑e∈f,f∈Fg

lexf , for some constant C that defines thelinear relationship between the expression level and the number of transcriptscorresponding to an isoform. C can be inferred from data as shown in [24].

From now on, we will consider a fixed gene g and omit the subscript g whenthere is no ambiguity. Let M be the total number of single-end reads mapped tothe reference genome and di the number of reads falling into expressed segmentsi. Under the uniform sampling assumption, di is the observed value of a randomvariable (denoted as ri) that follows the binomial distribution B(M,pi), wherepi = Cyili/L0 and yi =

∑si∈f

xf . Because M is usually very large, pi is verysmall and Mpi is sufficiently large in most cases, the binomial distribution can beapproximated by a normal distribution N(µi, σ

2i ), with µi = Mpi, σ

2i = Mpi(1−

pi) ≈ Mpi = µi, similar to the approximation in [29]. Therefore, the randomvariables ri−µi

σi, for every expressed segment si, follow the same distribution

approximately. Define ǫi = |ri − µi|. Then, the variable ǫσi

also follows the samedistribution approximately for every si.

Let L1 denote the length of a single-end read. In order to map reads tojunctions, we will also think of each junction (si, sj) as a segment of length2L1 − 2, consisting of the last L1 − 1 bases of si and the first L1 − 1 basesof sj . Denote the set of the junctions as J = s|S|+1, s|S|+2, . . . , s|S|+|J|. Therelationship among the expressed segments of gene g, its expressed isoforms,and the single-end reads mapped to each expressed segment and junction canbe captured by the following quadratic program (QP):

min z =∑si∈S∪J

( ǫiσi)2

s.t.∑si∈f

xf li + ǫi = di, si ∈ S ∪ Jxf ≥ 0, f ∈ F

where σi is the standard deviation in the normal distribution N(µi, σ2i ) and will

be empirically estimated from di.Note that if each ri follows the normal distribution strictly, then the random

variables ǫiσi

is i.i.d. and thus the solution of the above QP would correspond tothe maximum likelihood estimation of the xf ’s if each σi is fixed [51], and the ob-jective function z is a random variable obeying the χ2 distribution with freedom

5

Page 6: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

|S|+ |J |. This QP can be easily shown to be a convex QP by a simple transfor-mation and solved in polynomial time by a public program QuadProg++ whichimplements the dual method of Goldfarb and Idnani [52] for convex quadraticprogramming. Since σi is unknown, we substitute

√di for σi as an approxi-

mation. Let QPsolver denote the above algorithm for solving the convex QPprogram. Given S, F, and di’s, QPsolver returns the values of xi’s and z.

When the isoforms in F are given, minimizing the objective function meansto find a combination of the expression level (xf ) of each isoform in F such thatthe observed values (di’s) can be explained the best. In this case, the value ofthe objective function serves as an indicator of whether the isoforms in F canexplain the observed data. More specifically, p-value(z) denotes the probabilityof P (Z ≥ z), where Z is a random variable following the χ2 distribution withfreedom |S| + |J |. We empirically choose a cutoff of 0.05. If p-value(z) is lessthan 0.05 we conclude that F cannot explain d.

2.3 Paired-end reads

Figure 2(left) illustrates some concepts concerning paired-end reads. A paired-end read consists of a pair of short (single-end) reads separated by a gap. Thefigure also defines the read length, span, start position, center position and endposition of a paired-end read. If the span of a paired-end read is a random variablefollowing some probability distribution h(x), then three possible strategies forgenerating paired-end reads will be considered in this paper.

– Strategy (a): The start position of a paired-end read is uniformly and randomlysampled from all the expressed isoforms. Then the span of this paired-end is gen-erated following the distribution h(x). If the end position of this paired-end readfalls out of the isoform, the paired-end read is truncated such that the end positionof this read is at the end of the isoform.

– Strategy (b): The center position of a paired-end read is uniformly and randomlysampled from all the expressed isoforms. Then the span of this paired-end is gener-ated following the distribution h(x). This strategy has been adopted in [53]. Again,if the start (or end) position of this paired-end read falls out of the isoform, thepaired-end read is truncated such that the start (or end, respectively) position ofthis read is at the start (or end, respectively) of the isoform.

– Strategy (c): The end position of a paired-end read is uniformly and randomly sam-pled from all the expressed isoforms. Then the span of this paired-end is generatedfollowing the distribution h(x). If the start position of this paired-end read fallsout of the isoform, the paired-end read is truncated such that the start position ofthis read is at the start of the isoform.

Let w1, w2, w3 be the lengths of three consecutive intervals on an isoform asshown in Figure 2(right). When any of the strategies (a-c) is applied to generatea certain number of paired-end reads, the following Theorem 1 gives a non-trivialupper bound on the probability of not observing any reads with start positionsin the first interval and end positions in the third interval.

6

Page 7: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

Fig. 2. Left: A paired-end read consisting of two short reads of length L2 that areseparated by a gap. Right: Three consecutive intervals on an isoform.

Theorem 1 Suppose that the expression level of this isoform is α RPKM (i.e.,reads per kilobase of exon model per million mapped reads [24]), and the span ofeach paired-end read follows some distribution h(x). If M paired-end reads aregenerated by any of the strategies (a-c), the probability that there are no paired-end reads that have start positions in the first interval and end positions in thethird interval is upper bounded by

PM,h,α(w1, w2, w3) = (1 − P0)M ≈ e−MP0

where P0 = 10−9α∑

0≤i<w1

∫ u(i)

l(i)h(x)dx, l(i) = w1 − i + w2, and u(i) = w1 −

i+ w2 + w3.

2.4 Valid isoforms

For a gene with expressed segments S = s1, s2, . . . , s|S|, an isoform f of thisgene can be expressed as a binary vector with length |S|. The ith element f [i]of f is 1 if and only if expressed segment si is contained in f . Denote the setof all possible binary vectors with n elements as B(n). Similarly, a single-end orpaired-end short read that is mapped to a subset S′ ⊆ S of expressed segmentscan be represented as a binary vector r ∈ B(|E|) such that r[i] = 1 if andonly if si ∈ E′. A subset E′ of expressed segments is supported by single-end orpaired-end reads if there is at least one single-end or paired-end read r such thatr[i] = 1, i ∈ E′.

Although single-end reads, paired-end reads, and TSS-PAS information datado not provide exact combinations of expressed segments of isoforms, they canbe used to eliminate many isoforms from consideration. Each of these types ofdata provides some information that can be used to define a condition whichwill be satisfied by all isoforms inferred by our algorithm (to be described in thenext subsection).

– Junction information. A junction (si, sj) is on an isoform f if f [i] = f [j] = 1 andf [k] = 0, i < k < j. If si and sj are adjacent, then junction (si, sj) is an adjacent

junction. An isoform satisfies condition I if all the non-adjacent junctions on thisisoform are supported by single-end short reads. In practice, most sufficiently ex-pressed isoforms satisfy this condition. For example, when 40 millions single-endreads with length 30bps are mapped, the probability of an isoform with expressionlevel 6 RPKM satisfying condition I is 99.3% and 92.8%, if this isoform contains10 and 100 exons, respectively. See Theorem 2 below for the details.

– Start-end segment pair information. For an isoform f , expressed segment si isthe start expressed segment of f if f [i] = 1 and f [j] = 0, 1 ≤ j < i. Expressedsegment si is the end expressed segment of f if f [i] = 1 and f [j] = 0, i < j ≤ |S|.

7

Page 8: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

The TSS-PAS pair information describes the start and end expressed segmentsof each isoform and will be referred to as the start-end segment pair data. Anisoform satisfies condition II if the start-end segment pair of this isoform appearsin the given set of start-end segment pairs. If the TSS-PAS pair information ismissing, then any expressed segment can theoretically be the start or end expressedsegment. However, in this case, many short (and thus unrealistic) isoforms couldbe introduced, which will make isoform inference difficult. Therefore, when theTSS-PAS pair information is missing, we allow an expressed segment si to be thestart (or end) expressed segment of any isoform if there is no expressed segment sj

with j < i (or i < j) such that junction (sj , si) (or (si, sj), respectively) is adjacentor supported by some read.

– Paired-end read data. A pair of expressed segments (si, sj), i < j on an isoformf is an informative pair if f [i] = f [j] = 1 and PM,h,α(li + L2 − 1, gi,j , lj + L2 −1) < 0.05, assuming that the span of a paired-end read follows some probabilitydistribution h(x), the expression level of this isoform is α RPKM and M paired-endreads have been mapped. Here, L2 is the read length of a paired-end read, gi,j =P

i<k<jlkf [k], and PM,h,α is defined in Theorem 1. According to the theorem, if

(si, sj) is informative, then the probability that there are no paired-end reads withstart positions in segment si and end positions in segment sj is less than 0.05.A triple of expressed segments (si, si+1, sj), i + 1 < j is an informative triple iff [i] = f [i + 1] = f [j] = 1 and PM,h,α(L2 − 1, gi,j , lj + L2 − 1) < 0.05. Similarly,(si, si+1, sj), j < i is an informative triple if PM,h,α(L2−1, gj,i+1, lj+L2−1) < 0.05.An isoform satisfies condition III if every informative pair or triple on this isoformis supported by paired-end reads. A larger α makes this condition more stringent.Because in many cases, two isoforms can only be distinguished by a pair or triple ofsegments, it is necessary to require that every informative pairs or triple (insteadof some of them) are supported by paired-end reads.

Note that while the junction information is always available given the single-end read data and exon-intron boundary information, the start-end segment pairinformation and paired-end read data are not necessarily always available. Wedefine an isoform as valid if it satisfies conditions I, II and/or III whenever thecorresponding types of data are provided. The following theorem gives a lowerbound on the probability that type I condition is satisfied by an isoform.

Theorem 2 Under the uniform sampling assumption, the probability that anisoform f consisting of t exons with expression level x RPKM satisfies type Icondition is at least (1−e−xL1M/109

)t−1, where e is the base of natural logarithm,M the number of single-end reads mapped, and L1 the length of single-end reads.

2.5 Isoform inference algorithm

We now describe our algorithm, IsoInfer, for inferring isoforms. The algorithmuses the following types of data: the reference genome, single-end short reads,exon-intron boundaries, TSS-PAS pairs, gene boundary information from thereference genome annotation, and paired-end short reads. The first three piecesof information (i.e., the reference genome, exon-intron boundaries and single-endshort reads) are required in the algorithm. If TSS-PAS pairs are not provided,

8

Page 9: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

gene boundaries would be required. The flow of data processing in IsoInfer isillustrated in Figure 3. The third step of the algorithm requires an external tool(e.g., Bowtie [45]) to map the short reads to the reference genome and junctionsequences. In the fifth step, any two segments that are adjacent or supportedby a junction read will be clustered together. Note that, such a cluster maycontain expressed segments from more than one gene or contain only a subsetof expressed segments from a single gene, but these cases do not happen veryoften. Furthermore, in each cluster, if there is a sequence of consecutive ex-pressed segments such that every internal segment has no non-adjacent junctionwith any other expressed segment other than its left or right neighbor in the se-quence, then we will combine the expressed segments into a single segment. Thiscompression will be important because it reduces the problem size drasticallyfor some isoforms containing a very large number of expressed segments. Thedetails of the clustering and compression step are straightforward and omitted.

Fig. 3. The flow of data processing in algorithm IsoInfer.

In the following, we give more details of the last step in IsoInfer, i.e., in-ferring isoforms. Each cluster of expressed segments defines an instance of theisoform inference problem. Denote such an instance as I(S,R, T, d), where S =s1, s2, . . . , s|S| is the set of expressed segments in the cluster, R the set of short(single-end and paired-end) reads mapped to the segments in the cluster, T theset of start-end segment pairs, and d a function such that d(i), si ∈ S, denotesthe number of single-end reads mapped to segment si and d(i, j), 1 ≤ i < j ≤ |S|,denotes the number of single-end reads mapped to junction (si, sj).

The inference procedure is summarized in Algorithm 1. It first enumeratesall the valid isoforms in step 1. However, for a cluster with a large number ofexpressed segments and isoforms, the number of valid isoforms could be toolarge to be enumerated efficiently even though conditions I, II and/or III couldbe used to filter out many invalid isoforms. Therefore, the algorithm enumeratesvalid isoforms with high expression levels first, where the expression level of anisoform is defined by the least number of single-end reads on any junction ofthe isoform. The enumeration terminates when a preset number (denoted as γ)of valid isoforms are enumerated. The parameter γ is used to avoid the rarecases that the number of valid isoforms is too large to be handled by subsequentsteps of IsoInfer. We set γ = 1000 by default based on our empirical knowledgeof the real data considered in section 4. For example, over 97.5%, 98.5%, and99% cases, the number of valid isoforms is no more than 1000 in the tests onmouse brain, liver and muscle tissues, respectively, when the exact boundaryand TSS-PAS information is extracted from the UCSC knownGene table. The

9

Page 10: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

impact of the omitted isoforms is minimized because highly expressed isoformsare enumerated first.

A short read r is validated by a set of isoforms if the set contains an isoformf such that f [i] = 1 when r[i] = 1. A start-end segment pair is validated by a setof isoforms if this pair is the start-end segment pair of some f in the set. A setof isoforms is a feasible solution of I(S,R, T, d) if every read in R and start-endsegment pair in T are validated by the set. Due to possible noise in sequencingand the incompleteness of the enumeration of valid isoforms in step 1, it mayhappen that some reads or start-end segment pairs are not supported by theset of isoforms F enumerated in step 1. Step 2 of the algorithm removes suchinvalidated reads and start-end segment pairs to make F feasible.

Algorithm 1 IsoformInference. Given an instance I(S,R, T, d), the algorithminfers a set of isoforms to explain the read data.

1: Among all segment junctions of an isoform f , denote m(f) as the minimum numberof single-end reads mapped to any of these junctions. Enumerate all the validisoforms f in the descending order of m(f) until a preset number (γ) of validisoforms is obtained. Denote the set of all the enumerated valid isoforms as F .

2: Remove all the short reads and start-end segment pairs that are not validated byF .

3: for 5 ≤ u ≤ β do

4: w(f)← 0 for f ∈ F .5: for 0 ≤ m ≤ |S| − u do

6: n← m + u.7: V (m,n) ← BestCombination(I(m,n)).8: For each v ∈ V (m,n), define G(v) = f |f ∈ F, f (m,n) = v and for each

f ∈ G(v), let w(f) = w(f) + 1/|G(v)|.9: end for

10: Sort F by w in increasing order.11: for f ∈ F do

12: if w(f) < 1 and F − f is a feasible solution of I then

13: F ← F − f.14: end if

15: end for

16: end for

17: w′(f)← 1/w(f) for f ∈ F .18: Solve the weighted set cover instance (U, C, w′), where U = R∪T, C = Sf |f ∈ F,

and r ∈ Sf if r is validated by f for r ∈ U for each f ∈ F by the branch-and-boundmethod implemented in GNU package GLPK. Return the set of the valid isoformscorresponding to the optimal solution of set cover.

To find a subset of valid isoforms to explain the data, a simple idea is to try allpossible combinations of the valid isoforms in F and find a minimum combinationthat can explain all the short reads, as done in procedure BestCombination (i.e.,Algorithm 2). The procedure BestCombination gradually increases the number

10

Page 11: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

of valid isoforms considered and enumerates all possible combinations of such anumber of isoforms until a preset condition is met.

Algorithm 2 BestCombination. Given an instance I(S,R, F, d), find a “best”subset of F such that the read data can be explained by enumerating all possiblesubsets of F .1: for 1 ≤ i ≤ |S| do

2: p← 0 and F ′ ← ∅.3: for each F ′′ ⊂ F where |F ′′| = i and F ′′ is a feasible solution of I do

4: z, x ←QPsolver(I(S, F ′′, d)).5: if p < p-value(z) then

6: p← p-value(z) and F ′ ← F ′′.7: end if

8: end for

9: if p ≥ 0.05 then

10: Return F ′.11: end if

12: end for

However, it is often infeasible to enumerate all possible combinations of thevalid isoforms of a given size. When this happens, we decompose an the instanceinto some sub-instances. In each sub-instance, only a subset of expressed seg-ments are considered. More specifically, for an instance I(S,R, F, d), where F isthe set of valid isoforms enumerated, a sub-instance I(m,n) = I(S(m,n), R(m,n),d(m,n), F (m,n)), 0 ≤ m < n ≤ |S|, is defined concerning the subset S(m,n) =sm+1, . . . , sn of expressed segments of S. It is formally defined as follows. Foreach f ∈ B(|S|), define f (m,n) ∈ B(n−m) and f (m,n)[i] = f [i+m], 1 ≤ i ≤ n−m.In other words, f (m,n) denotes the sub-vector of f spanning the interval [m+1, n].Let F (m,n) = f (m,n)|f ∈ F, R(m,n) = r(m,n)|r ∈ R, d(m,n)(i) = d(i+m), 1 ≤i ≤ n−m, and d(m,n)(i, j) = d(i+m, j +m), 1 ≤ i < j ≤ n−m. Note that thestart-end segment information is not needed in sub-instances.

The parameter β appearing in step 3 controls the maximum size of a sub-instance. Larger sub-instances make the results of procedure BestCombinationmore reliable. However, the execution time of BestCombination increases expo-nentially with the number of valid isoforms which grows with the size of thesub-instance. Therefore, instead of a fixed size, a set of sub-instance sizes fromthe interval [5, β] are attempted. For a fixed sub-instance size, BestCombinationis executed on each sub-instance of the size in step 7. According to the resultsof BestCombination, each valid isoform is assigned a weight in Step 8 whichroughly indicates the frequency that the isoform appears in the combinationsfound by BestCombination. A subset of valid isoforms with weights less than 1are removed in steps 11-15 without making F infeasible.

In steps 17 and 18 of the algorithm, a weighted set cover instance is con-structed such that an optimal solution implies a subset of valid isoforms witha minimum total weight such that all the short reads and start-end segments

11

Page 12: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

can be explained. The set cover problem can be solved by using the branch-and-bound method implemented in GNU package GLPK, since it involves only smallinstances.

3 Simulation test results

We test IsoInfer on mouse genes. The reference genomic sequence and knownisoforms of all mouse genes are downloaded from UCSC (mm9, NCBI Build37) [54]. All exon-intron boundaries of the known isoforms are extracted. Thisdataset contains 26,989 genes and 49,409 isoforms. 16,392 (60.7%) of the geneshave only one isoform and 59 (0.2%) of the genes have more than 10 isoforms.5830 (21.6%) of the genes have only one exon and 384 (1.4%) of the genes havemore than 40 exon-intron boundaries. For the simulation study, only genes withat least two known isoforms are used, which result in 10,595 genes. We furtherextract all the start-end segments and randomly generate relative expressionlevels of every isoform. Although it would be natural to assume that expressionlevels follow a uniform distribution, it is reported in [55–57] that the expressionlevels of isoforms tend to obey a log-normal distribution. Therefore, we considerthree types of distributions.

– Base10: For each isoform, a random number r following the standard normaldistribution is generated and then 10r is assigned as the relative expressionlevel of this isoform.

– Base2: For each isoform, a random number r following the standard normaldistribution is generated and then 2r is assigned as the relative expressionlevel of this isoform.

– Uniform: For each isoform, a random number r uniformly generated from[0,1] is assigned as the relative expression level of this isoform.

Then 40M single-end and 10M paired-end short reads are randomly generatedaccording to the relative expression levels of the isoforms. In the simulation, weassume that the span of a paired-end read is a random variable obeying thenormal distribution N(µ, σ2) [58] so we could evaluate the impact of the meanand deviation of the spans of paired-end reads on the performance of IsoInfer.Note that IsoInfer does not depend on this assumption and works for paired-endreads drawn from any distribution.

Finally, IsoInfer is used to recover all the known isoforms using the start-end segments and single-end and paired-end reads. In the simulation, the readlengths of single-end and paired-end reads are 25bps and 20bps, respectively.The parameter α is set to 1 RPKM, β = 7 and γ = 1000. We consider threemeasures of the performance, sensitivity, effective sensitivity and precision. Aknown isoform is recovered if it is in the output of IsoInfer. Sensitivity is definedas the number of recovered isoforms divided by the number of all known iso-forms. Specificity is defined as the number of recovered isoforms divided by thenumber of isoforms inferred. Since IsoInfer only intends to infer isoforms thatare sufficiently expressed, it is useful to consider how many sufficiently expressed

12

Page 13: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

isoforms are recovered by the algorithm. Since Theorem 2 shows that an isoformwith a sufficiently high expression level is likely to satisfy condition I (i.e., all itsexon-intron junctions are supported by the read data) with high probability, wedefine effective sensitivity as the number of recovered isoforms divided by thenumber of known isoforms whose exon-intron junctions are supported by theread data.

3.1 Calculation of expression levels

To estimate the effectiveness of our QP formulation, we randomly generateBase10 expression levels and single-end short reads on the known mouse isoformsand check whether it can recover the correct expression levels of the known iso-forms. For an isoform f with expression level xf and calculated expression level

x′f , the relative difference|x′

f−xf |

xfis used to measure the accuracy of calculation.

A simple and widely used method of calculating expression levels of isoforms isbased on counting reads mapped to its unique exons and exon junctions [41, 40].Clearly, this simple strategy fails if the isoform does not have any unique ex-ons or exon junctions. We compare our method with the simple method (simplydenoted as Uniq in this paper) and the method based on maximum likelihood es-timation (MLE) and importance sampling (IS) [29]. The comparison is depictedin Figure 4.

Relative difference (10M)

0 0.15 0.35 0.55 0.75 0.95

0.0

0.2

0.4

0.6

0.8

1.0MLE+ISMLEIsoInferUniq

Relative difference (80M)

0 0.15 0.35 0.55 0.75 0.95

0.0

0.2

0.4

0.6

0.8

1.0

MLE+ISMLEIsoInferUniq

Fig. 4. Comparison of the accuracies of different methods in estimating isoform expres-sion levels. The Y-axis shows the percentage of isoforms whose estimated/calculatedexpression levels are within a certain relative difference range from the truth. 10 millionreads (left) and 80 million reads (right) are sampled in each of the figures.

The comparison shows that MLE followed by IS (MLE+IS) is the most accu-rate and Uniq is the worst. IsoInfer achieves comparable performances with MLE(followed by IS). An advantage of MLE+IS is that it also provides a 95% confi-dence interval for each expression level estimation. On the other hand, IsoInfercalculates the expression levels much faster than MLE+IS does (3 minutes vs 3hours for all mouse genes on a standard desktop PC). The efficiency of IsoInfermakes the search for novel isoforms possible.

13

Page 14: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

3.2 The influence of the distribution of expression levels

In this section, we analyze the influence of the distribution of expression levelson the performance of IsoInfer in inferring isoforms. The distribution of thespan of paired-end reads are fixed as the normal distribution N(300, 302). Thesensitivities and precisions grouped by number of known isoforms per gene aredepicted in Figure 5.

The overall sensitivities and precisions of IsoInfer on (Base10, Base2, Uni-form) expression levels are (39.7%,75.0%,72.5%) and (79.3%,82.1%,81.3%), re-spectively. The sensitivities for Base10 expression levels are much lower thanthose for Base2 and Uniform expression levels, because a large faction of theisoforms are not significant expressed. The effective sensitivity of three casesare 83.5%, 77.4% and 77.4%, respectively. Figure 5 gives detailed sensitivity,effective sensitivity and precision of IsoInfer on genes with a certain numberof isoforms. The high effective sensitivity shown in the figure is also confirmedby the sensitivity results on different expression levels, also given in Figure 5which shows that isoforms with high expression levels are identified with highsensitivities. For example, for Base10 expression levels, isoforms with expressionlevel above 3 (or 6) RPKM are identified with sensitivity above 56.0% (or 81.0%,respectively).

#Isoforms per gene

Sen

sitiv

ity

2 3 4 5 6 7 >7

0.0

0.2

0.4

0.6

0.8

1.0Base10Base2Uniform

#Isoforms per gene

Effe

ctiv

e se

nsiti

vity

2 3 4 5 6 7 >7

0.0

0.2

0.4

0.6

0.8

1.0

Base10Base2Uniform

#Isoforms per gene

Pre

cisi

on

2 3 4 5 6 7 >7

0.0

0.2

0.4

0.6

0.8

1.0

Base10Base2Uniform

Expression level

Sen

sitiv

ity

−15 −10 −6 −2 1 4 7 10

0.0

0.2

0.4

0.6

0.8

1.0

Base10Base2Uniform

Fig. 5. The sensitivity (top left), effective sensitivity (top right) and precision (bottomleft) of IsoInfer on genes with a certain number of isoforms when different distributionsof expression levels are generated. The bottom right graph shows the sensitivity ofIsoInfer on different expression levels when different distributions of expression levelare applied. In the graph, the expression levels are log2 transformed. Expression levelx corresponds to 25 · 2x RPKM. The vertical line corresponds to expression level 1/8= 3.125 RPKM.

14

Page 15: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

#Isoforms per geneS

ensi

tivity

2 3 4 5 6 7 >7

0.0

0.2

0.4

0.6

0.8

1.0II+III+IIII+II+III

#Isoforms per gene

Effe

ctiv

e se

nsiti

vity

2 3 4 5 6 7 >7

0.0

0.2

0.4

0.6

0.8

1.0

II+III+IIII+II+III

#Isoforms per gene

Pre

cisi

on

2 3 4 5 6 7 >7

0.0

0.2

0.4

0.6

0.8

1.0

II+III+IIII+II+III

Expression level

Sen

sitiv

ity

−15 −10 −6 −2 1 4 7 10

0.0

0.2

0.4

0.6

0.8

1.0

II+III+IIII+II+III

Fig. 6. The sensitivity (top left), effective sensitivity (top right) and precision (bottomleft) of IsoInfer on genes with a certain number of isoforms when different combinationsof type I, II and III data are provided. The bottom right graph shows the sensitivityof IsoInfer on different expression levels when different combinations of type I, II andIII data are used. Again, the expression levels are log2 transformed. Expression level xcorresponds to 25 · 2x RPKM. The vertical line corresponds to expression level 1/8 =3.125 RPKM.

3.3 The importance of start-end expressed segment pairs

As mentioned before, single-end short reads are necessary for our algorithmbut start-end segment pairs and paired-end reads are optional. To estimate theimportance of the last two pieces of information, we compare the results whendifferent types of data are available. Four combinations are possible, denoted asI, I+II, I+III, and I+II+III, where I, II and III correspond to single-end reads(which provide the junction information), start-end segment pairs and paired-end data, respectively. The combination I+III means that the single-end andpaired-end read data are available but not the start-end segment pairs. In thesimulation, Base10 expression levels are generated and the span distribution ofpaired-end reads is fixed as N(300, 302). Figure 6 shows that start-end segmentpairs are much more important than paired-end reads for our algorithm. Forexample, the sensitivities and precisions for combinations I+II and I+III are(38.9%,78.5%) and (29.5%,16.5%), respectively.

3.4 The influence of span distribution

The span of paired-end reads follows the normal distribution N(µ, σ2). We runIsoInfer on different combinations of µ and σ. On each combination, 10 millionpair-end reads are randomly generated. Since start-end segment pairs are muchmore important than paired-end reads, as shown in the above subsection, thespan distribution should not have a significant influence on the inference results

15

Page 16: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

when start-end segment pairs are available. This is confirmed by Tables 3 and4 given in [42]. The precision and sensitivity of IsoInfer vary by at most 1.5%when different span distributions are applied.

The above small effect of paired-end read data on the performance of IsoInferis because the parameter α is set to 1. When a large α is applied, IsoInfer tradessensitivity for precision. For example, when the span distribution of paired-endread is fixed as N(300, 302), if α is set to 1, the sensitivity and precision on geneswith at least 8 isoforms are 40.2% and 74.0%, respectively. The two measures willchange to 35.4% and 78.1%, respectively, when α is set to 20. The performanceof IsoInfer when α is set to different values is shown in Tables 5 and 6 of [42].

4 Recovery of known isoforms from real reads

The evaluation uses the following four data sets: (1) known mouse isoforms down-loaded from UCSC [54], which contains 49,409 transcripts, (2) mouse mRNAsexpressed in various tissues downloaded from UCSC containing 228,779 mR-NAs, (3) RNA-Seq data from brain, liver and skeletal muscle tissues of mouse[24], which contains 47,781,892, 44,279,807 and 38,210,358 single-end reads forbrain, liver and muscle, respectively, and (4) 104,710 exon junctions that werepredicted by TopHat from the above RNA-Seq data for mouse brain tissue [19].

As in the simulation tests, on a specific tissue, one can only expect thatisoforms with expression levels above a certain threshold can be detected byRNA-Seq experiments, so as to be inferred by IsoInfer. Given a set of mappedreads, an isoform is said to be theoretically expressed if each exon except forthe first and last one of this isoform has expression level at least 1 RPKM andevery exon junction on this isoform is supported by short reads. (Note that thisdoes not really guarantee that the isoform is actually expressed.) The expres-sion levels of the first and last exons are ignored here because of the possible3′ and 5′ sampling biases in RNA-Seq [27, 24]. The theoretically expressed iso-forms among known mouse isoforms and mRNAs are used as benchmarks. Notethat the benchmarks change when different tissues are considered, because theexpression levels of isoforms change from tissue to tissue.

We have done two group of tests. The first one is to use the TSS-PAS pairand exon-intron boundary information from the known mouse isoforms and/ormRNAs from UCSC and RNA-Seq short reads to infer isoforms. The predictedisoforms are compared with the theoretically expressed isoforms in the corre-sponding benchmark. An isoform is recovered by IsoInfer if one of isoforms in-ferred by IsoInfer matches this isoform precisely (i.e., the two isoforms containexactly the same set of exons with exactly the same boundaries). The inferenceresults are shown in Table 1. These results demonstrate that when accurate exon-intron boundary and TSS-PAS pair information is provided, IsoInfer achieves areasonably good precision, and the precision increases as the size of the bench-mark increases. When known mouse isoforms are used, IsoInfer achieves decenteffective sensitivities (i.e., 72.9% for brain, 82.2% for liver and 83.0% for muscle).Because mRNAs were collected from different sources and tissues, a large frac-

16

Page 17: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

tion of them may not really be expressed in a specific tissue. Therefore, effectivesensitivity of IsoInfer drops when mRNAs are used as the benchmark.

Table 1. The performance of IsoInfer when different exon-intron boundary and TSS-PAS pair information and corresponding benchmarks are used. Here, “Union” meansthat the exon-intron boundary and TSS-PAS pair information is extracted from bothknown mouse isoforms and mRNAs and the benchmark is the union of the knownmouse isoforms and mRNAs.

Known isoforms mRNAs Union

Tissue Brain Liver Muscle Brain Liver Muscle Brain Liver Muscle

#Theoretically expressed 18521 12411 11723 87178 72594 69086 101392 82199 78298

Specificity 0.493 0.592 0.627 0.572 0.670 0.712 0.591 0.697 0.737

Effective sensitivity 0.729 0.822 0.830 0.328 0.352 0.366 0.335 0.365 0.381

The second test measures the performance of IsoInfer when the exact exon-intron boundary information is unavailable. The test uses exon-intron boundariespredicted by TopHat from the RNA-Seq read data on the mouse brain tissue andthe TSS-PAS pair information extracted from the known mouse isoforms and/ormRNAs. The test results are shown in Table 2. Although it is reported in [19] thatover 80% of the exon junctions predicted by TopHat are also exon junctions in theUCSC known mouse isoforms, the inference result on the known mouse isoformsis much worse than the result when exact exon-intron boundary informationis provided. On the other hand, when mRNA is used as the benchmark, theexon-intron boundaries provided by TopHat lead IsoInfer to a more aggressiveprediction (and thus achieving a better effective sensitivity).

In each of the above tests, the last three steps of IsoInfer shown in Figure 3took less than 80 minutes on an Intel P8600 processor.

Table 2. The performance of IsoInfer when the exon-intron boundary information isextracted from the exon junctions predicted by TopHat. These results are all on themouse brain tissue. The TSS-PAS pair information is extracted from the known mouseisoforms and/or mRNAs, depending on the benchmark. Again, “Union” means thatthe TSS-PAS pair information is extracted from both known mouse isoforms and thebenchmark is the union of the known mouse isoforms and mRNAs.

Known isoforms mRNAs Union

Specificity 0.240 0.362 0.378

Effective sensitivity 0.496 0.532 0.508

Acknowledgment

We thank Pirola Yuri for useful discussions. The research is supported in partby a CSC scholarship, NSF grant IIS-0711129, and NIH grants LM008991 andAI078885.

17

Page 18: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

References

1. Boguski, M.S., Tolstoshev, C.M., Bassett, D.E.: Gene discovery in dbEST. Science265(5181) (1994) 1993–1994

2. Boguski, M.S.: The turning point in genome research. Trends in BiochemicalSciences 20(8) (1995) 295 – 296

3. The FANTOM Consortium: The transcriptional landscape of the mammaliangenome. Science 309(5740) (2005) 1559–1563

4. The ENCODE Project Consortium: Identification and analysis of functional el-ements in 1% of the human genome by the ENCODE pilot project. Nature447(7146) (2007) 799–816

5. Weinstock, G.M.: ENCODE: more genomic empowerment. Genome Res 17(6)(2007) 667–668

6. Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn,J.L., Tongprasit, W., Samanta, M., Weissman, S., Gerstein, M., Snyder, M.: Globalidentification of human transcribed sequences with genome tiling arrays. Science306(5705) (2004) 2242–2246

7. Kwan, T., Benovoy, D., Dias, C., Gurd, S., Provencher, C., Beaulieu, P., Hud-son, T.J., Sladek, R., Majewski, J.: Genome-wide analysis of transcript isoformvariation in humans. Nat Genetics (2008)

8. Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D.,Santos, R., Schadt, E.E., Stoughton, R., Shoemaker, D.D.: Genome-wide surveyof human alternative pre-mRNA splicing with exon junction microarrays. Science302(5653) (2003) 2141–2144

9. Kapranov, P., Cheng, J., Dike, S., Nix, D.A., Duttagupta, R., Willingham, A.T.,Stadler, P.F., Hertel, J., Hackermuller, J., Hofacker, I.L., Bell, I., Cheung, E.,Drenkow, J., Dumais, E., Patel, S., Helt, G., Ganesh, M., Ghosh, S., Piccolboni,A., Sementchenko, V., Tammana, H., Gingeras, T.R.: RNA maps reveal new RNAclasses and a possible function for pervasive transcription. Science 316(5830)(2007) 1484–1488

10. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo,S., McCurdy, S., Foy, M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht,G., Vermaas, E., Williams, S.R., Moon, K., Burcham, T., Pallas, M., DuBridge,R.B., Kirchner, J., Fearon, K., Mao, J., Corcoran, K.: Gene expression analysisby massively parallel signature sequencing (MPSS) on microbead arrays. NatBiotechnol 18(6) (2000) 630–634

11. Reinartz, J., Bruyns, E., Lin, J.Z., Burcham, T., Brenner, S., Bowen, B., Kramer,M., Woychik, R.: Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct GenomicProteomic 1(1) (2002) 95–104

12. Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W.: Serial analysis of geneexpression. Science 270(5235) (1995) 484–487

13. Harbers, M., Carninci, P.: Tag-based approaches for transcriptome research andgenome annotation. Nat Meth 2(7) (2005) 495–502

14. Shiraki, T., Kondo, S., Katayama, S., Waki, K., Kasukawa, T., Kawaji, H., Kodz-ius, R., Watahiki, A., Nakamura, M., Arakawa, T., Fukuda, S., Sasaki, D., Pod-hajska, A., Harbers, M., Kawai, J., Carninci, P., Hayashizaki, Y.: Cap analysisgene expression for high-throughput analysis of transcriptional starting point andidentification of promoter usage. Proceedings of the National Academy of Sciencesof the United States of America 100(26) (2003) 15776–15781

18

Page 19: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

15. Kodzius, R., Kojima, M., Nishiyori, H., Nakamura, M., Fukuda, S., Tagami, M.,Sasaki, D., Imamura, K., Kai, C., Harbers, M., Hayashizaki, Y., Carninci, P.:CAGE: cap analysis of gene expression. Nat Meth 3(3) (2005) 211–222

16. Kim, J.B., Porreca, G.J., Song, L., Greenway, S.C., Gorham, J.M., Church, G.M.,Seidman, C.E., Seidman, J.G.: Polony multiplex analysis of gene expression(PMAGE) in mouse hypertrophic cardiomyopathy. Science 316(5830) (2007) 1481–1484

17. Ng, P., Wei, C.L., Sung, W.K., Chiu, K.P., Lipovich, L., Ang, C.C., Gupta, S.,Shahab, A., Ridwan, A., Wong, C.H., Liu, E.T., Ruan, Y.: Gene identification sig-nature (GIS) analysis for transcriptome characterization and genome annotation.Nat Methods 2 (2005) 105 – 111

18. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., Snyder,M.: The transcriptional landscape of the yeast genome defined by RNA sequencing.Science 320(5881) (2008) 1344–1349

19. Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions withRNA-Seq. Bioinformatics 25(9) (2009) 1105–1111

20. Graveley, B.R.: Molecular biology: power sequencing. Nature 453(7199) (2008)1197–1198

21. Yassour, M., Kaplan, T., Fraser, H.B., Levin, J.Z., Pfiffner, J., Adiconis, X.,Schroth, G., Luo, S., Khrebtukova, I., Gnirke, A., Nusbaum, C., Thompson, D.A.,Friedman, N., Regev, A.: Ab initio construction of a eukaryotic transcriptome bymassively parallel mRNA sequencing. Proceedings of the National Academy ofSciences 106(9) (2009) 3264–3269

22. Wilhelm, B.T., Marguerat, S., Watt, S., Schubert, F., Wood, V., Goodhead, I.,Penkett, C.J., Rogers, J., Bahler, J.: Dynamic repertoire of a eukaryotic transcrip-tome surveyed at single-nucleotide resolution. Nature 453(7199) (2008) 1239–43

23. Cloonan, N., Forrest, A.R., Kolle, G., Gardiner, B.B., Faulkner, G.J., Brown, M.K.,Taylor, D.F., Steptoe, A.L., Wani, S., Bethel, G., Robertson, A.J., Perkins, A.C.,Bruce, S.J., Lee, C.C., Ranade, S.S., Peckham, H.E., Manning, J.M., McKernan,K.J., Grimmond, S.M.: Stem cell transcriptome profiling via massive-scale mRNAsequencing. Nat Methods (2008)

24. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping andquantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7) (2008)621 – 628

25. Marioni, J., Mason, C., Mane, S., Stephens, M., Gilad, Y.: RNA-seq: an assessmentof technical reproducibility and comparison with gene expression arrays. GenomeRes 18(9) (2008) 1509–17

26. Sultan, M., Schulz, M.H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M.,Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe,S., Haas, S., Vingron, M., Lehrach, H., Yaspo, M.L.: A global view of gene activityand alternative splicing by deep sequencing of the human transcriptome. Science321(5891) (2008) 956–960

27. Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: a revolutionary tool for transcrip-tomics. Nature reviews. Genetics (2008)

28. Lacroix, V., Sammeth, M., Guigo, R., Bergeron, A.: Exact transcriptome recon-struction from short sequence reads. In: WABI ’08: Proceedings of the 8th inter-national workshop on Algorithms in Bioinformatics, Berlin, Heidelberg, Springer-Verlag (2008) 50–63

29. Jiang, H., Wong, W.H.: Statistical inferences for isoform expression in RNA-Seq.Bioinformatics 25(8) (2009) 1026–1032

19

Page 20: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

30. Pagani, F., Baralle, F.E.: Genomic variants in exons and introns: identifying thesplicing spoilers. Nat Rev Genet 5(5) (2004) 389–396

31. Srebrow, A., Kornblihtt, A.R.: The connection between splicing and cancer. J CellSci 119(13) (2006) 2635–2641

32. Williams, W.V.: Editorial hot topic: Transcriptome analysis in drug development(executive editor: William v. williams). Current Molecular Medicine 5 (2005) 1–2(2)

33. Heber, S., Alekseyev, M., Sze, S.H., Tang, H., Pevzner, P.A.: Splicing graphs andEST assembly problem. Bioinformatics 18(suppl-1) (2002) S181–188

34. Sammeth, M., Valiente, G., Guigo, R.: Bubbles: alternative splicing events ofarbitrary dimension in splicing graphs. In Vingron, M., Wong, L., eds.: RECOMB.Volume 4955 of Lecture Notes in Computer Science., Springer (2008) 372–395

35. Xing, Y., Resch, A., Lee, C.: The multiassembly problem: reconstructing multipletranscript isoforms from EST fragment mixtures. Genome Res 14(3) (2004) 426–441

36. Bonizzoni, P., Mauri, G., Pesole, G., Picardi, E., Pirola, Y., Rizzi, R.: Detectingalternative gene structures from spliced ESTs: a computational approach. Journalof Computational Biology 16(1) (2009) 43–66

37. Djebali, S., Kapranov, P., Foissac, S., Lagarde, J., Reymond, A., Ucla, C., Wyss,C., Drenkow, J., Dumais, E., Murray, R.R., Lin, C., Szeto, D., Denoeud, F., Calvo,M., Frankish, A., Harrow, J., Makrythanasis, P., Vidal, M., Salehi-Ashtiani, K.,Antonarakis, S.E., Gingeras, T.R., Guigo, R.: Efficient targeted transcript dis-covery via array-based normalization of RACE libraries. Nat Meth 5(7) (2008)629–635

38. Salehi-Ashtiani, K., Yang, X., Derti, A., Tian, W., Hao, T., Lin, C., Makowski,K., Shen, L., Murray, R.R., Szeto, D., Tusneem, N., Smith, D.R., Cusick, M.E.,Hill, D.E., Roth, F.P., Vidal, M.: Isoform discovery by targeted cloning, ’deep-well’pooling and parallel sequencing. Nat Meth 5(7) (2008) 597 – 600

39. Fullwood, M.J., Wei, C.L., Liu, E.T., Ruan, Y.: Next-generation DNA sequencingof paired-end tags (PET) for transcriptome and genome analyses. Genome Res19(4) (2009) 521–532

40. Pan, Q., Shai, O., Lee, L.J., Frey, B.J., Blencowe, B.J.: Deep surveying of alterna-tive splicing complexity in the human transcriptome by high-throughput sequenc-ing. Nat Genet 40(12) (2008) 1413–1415

41. Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C.,Kingsmore, S.F., Schroth, G.P., Burge, C.B.: Alternative isoform regulation inhuman tissue transcriptomes. Nature 456(7221) (2008) 470–476

42. Feng, J., Li, W., Jiang, T.: Inference of isoforms from short sequence reads.Manuscript (Jan 2010)

43. Breitbart, R.E., Andreadis, A., Nadal-Ginard, B.: Alternative splicing: a ubiqui-tous mechanism for the generation of multiple protein isoforms from single genes.Annual Review of Biochemistry 56(1) (1987) 467–495

44. Sammeth, M., Foissac, S., Guig, R.: A general definition and nomenclature foralternative splicing events. PLoS Comput Biol 4(8) (2008) e1000147

45. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficientalignment of short DNA sequences to the human genome. Genome Biology 10(3)(2009) R25

46. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and callingvariants using mapping quality scores. Genome Res 18(11) (2008) 1851–1858

47. Li, R., Li, Y., Kristiansen, K., Wang, J.: SOAP: short oligonucleotide alignmentprogram. Bioinformatics 24(5) (2008) 713–714

20

Page 21: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

48. Cloonan, N., Xu, Q., Faulkner, G.J., Taylor, D.F., Tang, D.T., Kolle, G., Grim-mond, S.M.: RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics (2009) btp459

49. Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari,F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., Sahinalp, S.C., Gibbs, R.A.,Eichler, E.E.: Personalized copy number and segmental duplication maps usingnext-generation sequencing. Nat Genet 41(10) (2009) 1061–1067

50. Hashimoto, T., Hoon, M.J.d., Grimmond, S.M., Daub, C.O., Hayashizaki, Y.,Faulkner, G.J.: Probabilistic resolution of multi-mapping reads in massively par-allel sequencing data using MuMRescueLite. Bioinformatics (2009) btp438

51. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2007)52. Goldfarb D, I.A.: A numerically stable dual method for solving strictly convex

quadratic programs. Math Program 27 (1983) 1–3353. Korbel, J., Abyzov, A., Mu, X., Carriero, N., Cayting, P., Zhang, Z., Snyder, M.,

Gerstein, M.: PEMer: a computational framework with simulation-based errormodels for inferring genomic structural variants from massive paired-end sequenc-ing data. Genome Biology 10(2) (2009) R23

54. Karolchik, D., Kuhn, R.M., Baertsch, R., Barber, G.P., Clawson, H., Diekhans,M., Giardine, B., Harte, R.A., Hinrichs, A.S., Hsu, F., Kober, K.M., Miller, W.,Pedersen, J.S., Pohl, A., Raney, B.J., Rhead, B., Rosenbloom, K.R., Smith, K.E.,Stanke, M., Thakkapallayil, A., Trumbower, H., Wang, T., Zweig, A.S., Haussler,D., Kent, W.J.: The UCSC genome browser database: 2008 update. Nucl AcidsRes 36(Database issue) (2008) D773–9

55. Alter, M.D., Rubin, D.B., Ramsey, K., Halpern, R., Stephan, D.A., Abbott, L.F.,Hen, R.: Variation in the large-scale organization of gene expression levels in thehippocampus relates to stable epigenetic variability in behavior. PLoS ONE 3(10)(10 2008) e3344

56. Konishi, T.: Three-parameter lognormal distribution ubiquitously found in cdnamicroarray data and its application to parametric data treatment. BMC Bioinfor-matics 5(1) (2004) 5

57. Wijaya, E., Harada, H., Horton, P.: Modeling the marginal distribution of geneexpression with mixture models. In: FGCN ’08: Proceedings of the 2008 SecondInternational Conference on Future Generation Communication and Networking,Washington, DC, USA, IEEE Computer Society (2008) 84–89

58. Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSima sequencingsimulator for genomics and metagenomics. PLoS ONE 3(10) (2008) e3373

21

Page 22: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

Appendix

The proof of Theorem 1

Proof. For simplicity, we assume that the distributions involved in the followingproof are discrete. Let q(x) be the probability that the span of a randomlygenerated paired-end read is x, and p(x) the probability of a uniformly randomlyselected position from all isoforms being at position x on the given isoform. Everypaired-end read can be represented by its start (or center or end) position andspan uniquely. Denote the set of all possible start positions as Ψ and the set of allpossible spans as Ω. Let V ⊂ Ψ×Ω defines the set of paired-end reads that havestart positions in the first interval and end positions in the third interval. Understrategy (a), the probability of a uniformly randomly generated paired-end readbeing in V is:

Pa =∑ψ∈Ψ

(∑

ω|(ψ,ω)∈V

q(ω))p(ψ)

=∑

(ψ,ω)∈V

q(ω)α/109

Similarly, we define the set of possible center positions of paired-end reads asΨ ′. Let V ′ ⊂ Ψ ′ ×Ω define the set of paired end reads that have start positionsin the first interval and end positions in the third interval. Under strategy (b),the probability of a uniformly randomly generated paired-end read being in V ′

is:Pb =

∑(ψ,ω)∈V ′

q(ω)α/109

Because |ψ|(ψ, ω) ∈ V ′| = |ψ|(ψ, ω) ∈ V | for ω ∈ Ω, we have Pa = Pb. Theargument is also applicable to case when strategy (c) is applied.

When strategy (a) is applied and the end position of the third interval isnot the end position of the given isoform, if the start position of a uniformlyrandomly generated paired-end read is i, 0 ≤ i < w1 in the first interval, thenthe probability of the end position of this paired-end read being in the thirdinterval is

pi = P (X ≤ u(i)) − P (X ≤ l(i)) =

∫ u(i)

l(i)

h(x)dx

where l(i) = w1 − i+w2, u(i) = w1 − i+w2 +w3. When the end position of thethird interval is the end position of the given isoform and strategy (a) is applied,we have

pi = P (X ≤ +∞) − P (X ≤ l(i)) =

∫ +∞

l(i)

h(x)dx ≥∫ u(i)

l(i)

h(x)dx

Because the start position of a paired-end read is uniformly randomly selected,

Pa = 10−9α∑

0≤i<w1

pi ≥ 10−9α∑

0≤i<w1

∫ u(i)

l(i)

h(x)dx = P0

22

Page 23: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

BecauseM paired-end reads are generated, the probability that none of the readshave start positions in the first interval and end positions in the third intervalis (1 − Pa)

M ≤ (1 − P0)M = PM,h,α(w1, w2, w3) ≈ e−MP0 .

Similar arguments hold when strategies (b) and (c) are applied to generatethe reads.

The proof of Theorem 2

Proof. If expression level of y RPKM of the isoform f corresponds to one tran-script of f , the total number of the expressed transcripts of f is x/y. Based on thedefinition of RPKM, y = (106 ·103)/L0 = 109/L0, where L0 is the total length ofall the expressed transcripts with duplications. For any junction, the probabilityof a read falling into this junction is xL/yL0. So, the probability that none of

the reads fall into this junction is (1−xL/yL0)M ≈ e−xLM/yL0 = e−xLM/109

. Inorder for this isoform to be valid, each of the t−1 junctions contains at least oneread. Therefore, the probability of this isoform being valid is (1−e−xLM/109

)t−1.Note that the sequencing noise does not decrease the above probability althoughit may provide some spurious junction reads.

Table 3. Sensitivities for various span distributions grouped by the number of isoformsper gene. Here, “No PE reads” means that no paired-end reads are applied. The firstcolumn lists various combinations of the mean and standard deviation in the span(normal) distributions considered. The corresponding effective sensitivities range from63.4% to 97.4%.

#isoforms per gene 2 3 4 5 6 7 ≥ 8

No PE reads 0.392 0.402 0.392 0.383 0.374 0.346 0.391

300, 10 0.393 0.406 0.402 0.391 0.385 0.357 0.402

300, 30 0.393 0.407 0.404 0.392 0.386 0.362 0.402

300, 50 0.393 0.407 0.402 0.393 0.385 0.366 0.402

300, 100 0.393 0.408 0.404 0.395 0.385 0.359 0.405

1100, 110 0.387 0.401 0.399 0.395 0.392 0.363 0.403

3000, 300 0.392 0.404 0.403 0.400 0.390 0.366 0.413

23

Page 24: Inference of Isoforms from Short Sequence Reads (Extended Abstract)

Table 4. Specificities for various span distributions grouped by the number of iso-forms per gene. The first column lists various combinations of the mean and standarddeviation in the span (normal) distributions considered.

#isoforms per gene 2 3 4 5 6 7 ≥ 8

No PE reads 0.893 0.824 0.774 0.732 0.704 0.638 0.733

300, 10 0.897 0.830 0.784 0.738 0.717 0.648 0.740

300, 30 0.897 0.831 0.786 0.739 0.718 0.657 0.740

300, 50 0.897 0.830 0.784 0.740 0.714 0.663 0.737

300, 100 0.896 0.830 0.786 0.743 0.713 0.649 0.740

1100, 110 0.896 0.829 0.782 0.739 0.720 0.657 0.729

3000, 300 0.896 0.828 0.776 0.741 0.709 0.648 0.736

Table 5. Sensitivities grouped by the number of isoforms per gene when α is set tovarious values.

#isoforms per gene 2 3 4 5 6 7 ≥ 8

α =1 0.393 0.407 0.404 0.392 0.386 0.362 0.402

2 0.379 0.390 0.388 0.373 0.374 0.347 0.389

3 0.375 0.381 0.381 0.363 0.362 0.332 0.381

4 0.370 0.376 0.375 0.355 0.356 0.325 0.377

5 0.368 0.371 0.371 0.353 0.353 0.325 0.375

6 0.364 0.367 0.366 0.348 0.349 0.324 0.374

7 0.363 0.364 0.363 0.344 0.347 0.323 0.370

8 0.361 0.362 0.361 0.342 0.345 0.323 0.370

9 0.360 0.361 0.360 0.340 0.344 0.323 0.367

10 0.359 0.361 0.358 0.340 0.343 0.323 0.367

20 0.350 0.350 0.340 0.327 0.332 0.306 0.354

Table 6. Specificities grouped by the number of isoforms per gene when α is set tovarious values.

#isoforms per gene 2 3 4 5 6 7 ≥ 8

α =1 0.897 0.831 0.786 0.739 0.718 0.657 0.740

2 0.895 0.833 0.785 0.738 0.721 0.664 0.738

3 0.897 0.835 0.786 0.741 0.724 0.659 0.741

4 0.898 0.838 0.792 0.741 0.730 0.662 0.751

5 0.900 0.842 0.797 0.749 0.734 0.668 0.757

6 0.900 0.845 0.797 0.750 0.738 0.675 0.762

7 0.901 0.844 0.797 0.749 0.742 0.686 0.766

8 0.902 0.845 0.799 0.748 0.747 0.696 0.772

9 0.902 0.847 0.803 0.752 0.752 0.698 0.771

10 0.902 0.849 0.804 0.754 0.754 0.703 0.774

20 0.905 0.853 0.804 0.758 0.760 0.697 0.781

24