transcomb: genome-guided transcriptome assembly via ... · method open access transcomb:...

METHOD Open Access

TransComb: genome-guided transcriptomeassembly via combing junctions in splicinggraphsJuntao Liu1,2†, Ting Yu1†, Tao Jiang2,3,4* and Guojun Li1*†

Abstract

Transcriptome assemblers aim to reconstruct full-length transcripts from RNA-seq data. We present TransComb, agenome-guided assembler developed based on a junction graph, weighted by a bin-packing strategy and paired-end information. A newly designed extension method based on weighted junction graphs can accurately extractpaths representing expressed transcripts, whether they have low or high expression levels. Tested on both simulatedand real datasets, TransComb demonstrates significant improvements in both recall and precision over leadingassemblers, including StringTie, Cufflinks, Bayesembler, and Traph. In addition, it runs much faster and requires lessmemory on average. TransComb is available at http://sourceforge.net/projects/transcriptomeassembly/files/.

Keywords: RNA-seq, Transcriptome assembly, Alternative splicing, Splicing graph, Isoform

BackgroundRecent research has revealed the immense complexityand diversity of transcriptomes in eukaryotes [1]. Tounderstand the exact mechanism of alternative splicingin multi-exon protein coding genes, researchers haveconducted both computational and experimental studies.A recent publication [2] showed that proteins translatedalong with alternative splicing can be captured by anelectron microscope, which clearly provides some in-sights into the mechanism of alternative splicing. How-ever, it remains a highly challenging task to elucidate thealternative splicing mechanism in eukaryotic species.The technology of RNA-seq is rapidly changing our

ability to explore the very complex transcriptomic land-scape as it can provide unprecedented accuracy in quan-tifying mRNA expression levels [3]. This technology alsoaccurately elucidates all splicing events, including thosethat are rare and expressed at a low level. This opens anew door to study the mechanism of complex diseasesrelated to abnormal splicing or expression levels, such ascancers. However, it has been observed [4–6] that

alternative splicing events may produce various isoformsin eukaryotic tissues and, moreover, there are manykinds of splicing events, including retained introns,skipped exons, and mutually exclusive exons, and someexons may even be partially spliced, so-called partialexons. In addition, an RNA-seq run may produce morethan 200 million short reads, each of which consists ofonly 50–150 bp. Therefore, it is highly challenging to ac-curately assemble the huge amount of short reads intofull-length transcripts computationally.There are two main strategies for transcriptome assem-

bly, genome-guided and de novo [3, 7]. Genome-guidedassemblers, such as StringTie [8], Bayesember [9],Cufflinks [10], Scripture [11], IsoInfer [12], IsoLasso [13],iReckon [14], CEM [15], Traph [16], and CIDANE [17],usually first map short reads to a reference genome usingan alignment tool, such as TopHat [18], TopHat2 [19],SpliceMap [20], or GSNAP [21], to cluster the reads intogene loci based on which the so-called splicing graph oroverlap graph can be constructed for each individualcluster. Then, a well-studied mathematical model, e.g.,minimum-cost minimum path cover, can be applied tosearch for transcript-representing paths in each splicinggraph, as it does in Cufflinks [10]. As stated in [9], how-ever, minimum-cost minimum path cover may not be agood choice because this strategy may lack a biological

* Correspondence: [email protected]; [email protected]†Equal contributors2Department of Computer Science and Engineering, University of California,Riverside, CA 92521, USA1School of Mathematics, Shandong University, Jinan 250100, ChinaFull list of author information is available at the end of the article

© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Liu et al. Genome Biology (2016) 17:213 DOI 10.1186/s13059-016-1074-1

http://crossmark.crossref.org/dialog/?doi=10.1186/s13059-016-1074-1&domain=pdf

http://sourceforge.net/projects/transcriptomeassembly/files/

mailto:[email protected]

mailto:[email protected]

http://creativecommons.org/licenses/by/4.0/

http://creativecommons.org/publicdomain/zero/1.0/

foundation and more complex solutions may better ex-plain the full graph coverage. Pertea et al. [8] demon-strated that Scripture has very low precision because ittends to predict a large number of splice variants for eachgene, most of which are false positives. StringTie uses agreedy algorithm to search for transcript-representingpaths in splicing graphs without effectively solving the am-biguities caused by exons with multiple splicing junctions.Moreover, StringTie updates the current splicing graphbased on estimated expression level of the current assem-bled transcript, which may lead to successively erroneouspredictions once an erroneous assembled transcript is en-countered because the splicing graph will be wrongly up-dated later on. Traph uses a minimum-cost flow modelcombined with a greedy algorithm in the assembling pro-cedure, which is easily collapsed especially for complicatedsplicing graphs. Others, such as IsoLasso, IsoInfer, CEMand iReckon, predict transcripts and estimate their expres-sion levels as well using regularization-based methods.However, these methods achieve sparsity by thresholdingtranscript expression levels, which results in penalizinglow expression transcripts. Bayesembler is a very recentlypublished probabilistic method for both transcriptome as-sembly and expression level estimation. It uses a probabil-istic model to simulate the RNA sequencing processwithout penalizing lowly expressed transcripts. However,an enumeration strategy was employed in its procedure,which makes it extremely time consuming. CIDANE is anewly published assembler which was developed based ongenome annotation. However, it is not included as a com-parator in this study as TransComb and the other com-pared tools only take as input the alignment of RNA-seqdata.De novo assemblers, such as BinPacker [22], Bridger

[23], Trinity [7], ABySS [24], SOAPdenovo-Trans [25],Oases [26], and IDBA-Tran [27], assemble transcriptsdirectly from short reads without using a reference gen-ome. However, the presence of gene homology, splicingevent diversity, and sequence coverage unevennessmakes the de novo strategy even more challenging,which usually results in a lower accuracy in comparisonwith the genome-guided strategy. Although an increas-ing number of transcriptome assemblers have been de-veloped to be either genome-guided or de novo, they areall limited by their accuracy in practical applications. Asclaimed in a recent study [28], even if all the exact exonsof a gene are given, they are often unable to assemblethe exons into correct isoforms. Therefore, it is impera-tive to develop novel algorithms for accurate recovery oftranscriptomes in eukaryotic species.In this article, we introduce TransComb, a new

genome-guided transcriptome assembler. The basic ideabehind TransComb is to subtly extract transcript-representing paths on so-called junction graphs weighted

on both their nodes and edges. The weighted junctiongraph is defined with the nodes representing the edgesof a splicing graph and an edge representing a pair of in-cident edges of the splicing graph. Each node in thejunction graph is assigned a weight by the coverage ofthe corresponding edge of the splicing graph, while eachedge in the junction graph is assigned a weight of 0, 1,or 2 defined by novel utilization of the two techniques,bin-packing strategy and paired-end information. It de-serves mention that the bin-packing strategy was firstdeveloped in our previous de novo assembler BinPackerfor combing splicing graphs, but it is subtly employed inTransComb along with the paired-end information toguide the accurate extraction of all the full-length tran-scripts over the so-called weighted junction graphsmentioned above. Tested on both simulated and realdatasets from multiple species, TransComb performsbetter than all the four compared leading assemblers, in-cluding StringTie, Cufflinks, Bayesembler, and Traph onall datasets. For example, TransComb correctly assem-bled 23 % more full-length transcripts than Bayesembler,the next best assembler, on the human K562 dataset,12 % more on the human H1 cell dataset, and 13 %more on the mouse dendritic cell dataset. In addition, itruns much faster than Cufflinks and Bayesembler andrequires less memory, on average, than all the otherassemblers.

Results and discussionWe first present an overview of our transcriptome re-covery model with a detailed description in the“Methods” section, followed by a comparison of Trans-Comb to the state-of-the-art assemblers of the samekind on both simulated and real datasets.

The TransComb modelTransComb first constructs the splicing graphs abinitio from the alignments of RNA-seq reads to thereference genome. Theoretically, each gene locus isrepresented by a splicing graph and each exon is rep-resented by a node in the graph and two nodes areconnected by a directed edge if and only if there is asplicing junction between them. The coverage of eachedge is defined as the number of reads spanning thecorresponding junction.In order to construct more accurate splicing graphs,

which would largely improve the performance of an as-sembler, TransComb utilizes the paired-end reads to re-pair the fragmented exons due to low expression levelsof the gene and further slides a window to correct thewrongly merged exons due to sequencing or mappingerrors. Based on the splicing graphs, we develop a rigor-ous mathematical model to determine the transcriptsmost likely to be expressed.

Liu et al. Genome Biology (2016) 17:213 of 9

Resolving the ambiguities in linking in- and out-splicing junctions at each exon with multiple splicingjunctions is the toughest task in the development of as-semblers. Most of the existing assemblers suffer fromthis, resulting in their low assembly accuracies (see anexample in Additional file 1: Figure S4). To alleviate this,we designed a new assembly strategy by introducing an-other directed acyclic graph, termed a junction graph,with nodes representing edges in the splicing graph andedges representing two incident edges in the splicinggraph. Each node in the junction graph is weighted bythe coverage of its corresponding edge in the splicinggraph, while each edge is weighted by 0, 1, or 2 definedby employing two techniques, bin-packing strategy andpaired-end information. A weight of 0 on an edge isconsidered as no information supporting this edge beingincluded in any expressed transcript, i.e., it will neverexist in any predicted transcript of TransComb. Aweight of 1 on an edge means that the information fromthe bin-packing strategy supports this edge being in-cluded in some expressed transcripts with high credibil-ity. In some special cases, the above ambiguities are hardto solve by the bin-packing strategy; e.g., the strategy willhave a problem when two transcripts sharing an exonhave very similar expression levels. Paired-end informa-tion subtly involved in junction graphs could then effect-ively be used to refine the solution from the bin-packingstrategy. Thus, a weight of 2 on an edge implies that theinformation from paired-end reads supports this edgewith high credibility. Benefiting from bin-packing andpaired-end information, TransComb can more accur-ately link in- and out-splicing junctions at each exon(i.e., adjacent nodes in the junction graph) and thereforeovercome ambiguities to a great extent. The substantialdifference between TransComb and other assemblerslies in their methodological aspects. TransComb simul-taneously integrates coverage and paired-end informa-tion by jointly applying bin-packing and paired-endtechniques to a so-called junction graph rather thanworking from the splicing graph as the other assemblersdo. In addition, we have also developed a novel ex-tension strategy for transcript-representing paths overthe junction graphs, by which each predicted pathhas a very high probability to represent an expressedtranscript no matter whether its expression level islow or high (see the “Methods” section and Fig. 1 formore details).

Performance evaluationWe compared TransComb with four other leadinggenome-guided assemblers—StringTie, Cufflinks, Baye-sembler, and Traph—on both simulated and real data-sets using their default and some other parameters onthe same server (see Additional file 1 for details of the

parameter setups for the compared assemblers). In ourexperiments, the alignments of RNA-seq reads generatedby TopHat2 (see Additional file 1 for download informa-tion for the TopHat2 index files) and the reference tran-scripts downloaded from the Ensembl Genome Browser(see Additional file 1 for download information) wereused to evaluate the assemblers on real datasets. A refer-ence transcript is considered to have been correctly re-covered if it has the same number of exons as anassembled transcript as well as exactly matched intronboundaries. In this study, Cuffcompare [10] was appliedto detect the correctly assembled transcripts. A refer-ence transcript is called a true positive for an assemblerif it was exactly recovered by the assembler. Accuracy ismeasured by recall and precision, where recall is definedas the fraction of true positives out of all expressed ref-erence transcripts in the experiment and precision is de-fined as the percentage of all assembled transcripts thatcorrectly match a reference transcript. We use recall andprecision to evaluate the assembly accuracy of eachcompared assembler.

Performance comparisons on simulated dataWe first tested TransComb against the other four as-semblers on the simulated dataset used in CIDANE [17]for their performance evaluation (see Additional file 1for download information for the simulated data). Thisdataset was generated by Flux Simulator [29] using allknown transcripts from the UCSC hg19 gene annota-tion, which contains approximately 80 million strand-specific RNA-seq paired-end reads of 100-bp length.Testing the five assemblers, TransComb, StringTie,

Cufflinks, Bayesembler, and Traph, with their defaultsettings, TransComb performed the best in transcriptrecovery, with 10,528 true positives versus 9988 forStringTie, 8281 for Cufflinks, 9835 for Bayesembler, and8299 for Traph (Additional file 1: Table S1). Broadly,TransComb recovered 5 % more true positives than thenext best assembler, StringTie, and the large differencesbetween TransComb and the others can be expected onlarge real datasets. So it is demonstrated from the com-parison results that TransComb achieves higher recallthan the other assemblers.Comparison results for precision show that TransComb

reaches 62.83 % versus StringTie with 55.09 %, Cufflinkswith 52.36 %, Bayesembler with 68.08 %, and Traph with42.32 %, where Bayesembler seems to perform the best.However, the better precision of Bayesembler comes fromthe fact that it filters too many candidates with the defaultsetting, while TransComb does not, making the numberof candidates generated by TransComb much larger thanthat generated by Bayesembler. To make a fair compari-son between Bayesembler and TransComb, we filtered thecandidates of TransComb using a filtering parameter


(see Additional file 1 for details) close in size to thatof Bayesembler. The comparison results in Fig. 2ademonstrate that the recall and precision achieved byTransComb are both higher than by those achievedby Bayesembler when filtering their candidates to thesame level (see Additional file 1: Table S1 for detailsof their filtered candidates).To evaluate the overall performance of the various as-

semblers, we computed their precision and recall byadjusting their filtering parameters (Fig. 2a; see Additionalfile 1 for details), showing that TransComb always outper-forms the others. So it is concluded that TransComb hasthe best accuracy amongst the assemblers tested on thesimulated dataset.An ideal assembler should be able to recover more

expressed transcripts no matter with low or high expres-sion levels. To evaluate the assemblers in terms of ex-pression levels, we computed their recall distributionsagainst transcripts expression levels in Fig. 2b, showing

that TransComb is superior over other compared assem-blers across all the expression levels.

Performance comparisons on real dataTests on real data should better evaluate the essence ofan assembler because the data have properties that can-not be accurately captured by simulations. In this study,two strand-specific datasets from human K562 cells andH1 cells and one strand-specific dataset from mousedendritic cells were used for evaluating the performanceof the compared assemblers. These three datasets con-tain 125 million, 41 million, and 53 million paired-endreads, respectively. The human K562 cell and H1 cellRNA-seq datasets were collected from the NCBI Se-quence Read Archive (SRA) database with accessioncodes SRX110318 and SRX082572, respectively. Themouse dendritic cell dataset was also collected from theNCBI SRA database with accession code SRX062280.Unfortunately, Traph does not work on real datasets due

Fig. 1 Methodological aspects of TransComb. For the two numbers of an edge in the splicing graph, the number above the edge represents thecoverage of the edge and the circled number below represents the index of the edge. The number on each node of the junction graphrepresents the weight of this node, which is the coverage of the corresponding edge on the splicing graph


to instability, so we only evaluated TransComb againstStringTie, Cufflinks, and Bayesembler. Just as presentedin [8], it is impossible for us to know all the genuinetranscripts encoded in the sample. However, we mayconsider those assembled as true positives if they cor-rectly match a reference one, or false positives otherwise,without substantially affecting their comparative per-formance. We thus compared TransComb against theother three assemblers on the three real datasets on thisbasis and the precision and recall from the simulateddataset were used to evaluate the accuracies of the as-semblers on real datasets.Running the four assemblers with their default settings

on the three real RNA-seq datasets, TransComb recov-ered more true positives than the other three assemblersfor each (Additional file 1: Tables S2, S3, and S4). Trans-Comb recovered 12,948, 12,135, and 11,117 true posi-tives for the human K562 and H1 cell and mousedendritic cell datasets, respectively, versus 10,507,10,832, and 9874 recovered by the second best assem-bler, Bayesembler. In comparison with Bayesembler,TransComb recovered 23 % more true positives on hu-man K562 cell data, 12 % more on human H1 cell data,and 13 % more on mouse dendritic cell data. The com-parison results thus demonstrate that TransCombachieves the highest recall among all the comparedassemblers.As for precision, TransComb achieved 24.5, 24.23, and

32.04 % on the human K562 and H1 cell and mousedendritic cell data, respectively, versus 17.69, 20.64, and38.26 % for StringTie, 13.43, 15.33, and 32.95 % forCufflinks, and 21.3, 23.72, and 28.18 % for Bayesembler.

The comparison results show that TransComb performsbetter than all the other assemblers on the human K562cell and H1 cell datasets. For the mouse dendritic celldataset, as stated in the last section, TransComb doesnot filter its candidates with its default setting whileother assemblers all filter their candidates with their de-fault settings, leading to their precision being higherthan that of TransComb, with the filtering used byStringTie and Cufflinks filter resulting in the highestprecision on the mouse dendritic cell dataset. As withthe simulation data, we filtered the candidates ofTransComb after setting its filtering parameter (seeAdditional file 1 for details) to a size close to that ofStringTie and Cufflinks. The comparison results inFig. 3c demonstrate that TransComb also achieved recalland precision levels higher than all the other comparedassemblers on the mouse dendritic cell dataset when fil-tering their candidates to the same level (see Additionalfile 1: Table S4 for details of their filtered candidates).As with the simulation, we also compared the preci-

sion and recall values when using different filtering pa-rameters (see Additional file 1 for details) on the threereal datasets (Fig. 3), showing that TransComb is con-sistently superior over the other assemblers on all testedreal datasets. Therefore, we conclude that TransCombhas the best accuracy among all the assemblers tested onboth simulated and real datasets.

Other comparisons on both simulated and real datasetsIn addition to the comparisons above, we furthercompared correctly identified genes and unique truepositives (see Additional file 1 for definitions of these

Fig. 2 Comparison results on the simulated dataset. a Precision and recall values for each assembler. Solid circles/squares represent the precisionand recall values derived using the default settings of the assemblers. Empty circles/squares represent the precision and recall values using non-default settings of the assemblers. The crossed circle represents the precision and recall values of TransComb when filtering its candidates to thesame level as Bayesembler. b Recall distributions against transcript expression levels


criteria) on both simulated and real datasets. The results(Additional file 1) showed that TransComb performsmuch better than all the compared assemblers accordingto these criteria, while it is only slightly inferior toStringTie in terms of correctly identified genes on thehuman H1 cell dataset (see Additional file 1 for furtherdetails).For expression level estimation, we developed a very

simple but unusual method which maintains accuracylevels comparable to the other tools but much betterthan Traph (see Additional file 1 for detailed comparisonresults). It may provide ideas for scientists interested indeveloping more powerful estimators.

Comparison of running time and memory usageWe show the running time and memory usage of thefour assemblers only for the human K562 cell dataset,

which contains the largest number of reads (approxi-mately 125 million reads) among all the tested datasets(Fig. 4; detailed in Additional file 1: Table S10). StringTieused the least CPU time (23 minutes) and TransCombused the second least CPU time (56 minutes), more thanseven times faster than Cufflinks (429 minutes) and ninetimes faster than Bayesembler (537 minutes). StringTieused less time than the other assemblers because it sim-ply applied a greedy algorithm to extend paths in thesplicing graphs. In addition, the current version ofTransComb was developed to run on one single CPUwhile StringTie runs on multiple CPUs. We willparallelize TransComb in the near future in order to in-crease its throughput when multiple CPUs are available.For memory usage, all four assemblers have a large

memory footprint (Additional file 1: Table S10) due tothe large RNA-seq dataset to be processed, especially on

Fig. 3 Comparison results on the three real datasets: a human k562 cells, b human H1 cells, and c mouse dendritic cells. Solid circles/squaresrepresent precision and recall values achieved using default settings. Empty circles/squares represent precision and recall values achieved usingdifferent filtering parameters. The crossed circle in c represents TransComb’s precision and recall values when filtering its candidates to the samelevel as StringTie and Cufflinks

Fig. 4 CPU time and memory usage of the four assemblers on the human K562 cell dataset. a CPU times of the four assemblers. b Memoryusage ranges of the four assemblers; the horizontal black lines represent the average memory usage for each assembler


the highly expressed transcripts. We see from Fig. 4bthat TransComb uses less memory than all the others,on average, while Bayesembler suffers the worst memoryusage, and StringTie and Cufflinks consume similarmemory resources on average.

ConclusionsWe present TransComb, a novel genome-guided compu-tational tool for transcriptome assembly from shortRNA-seq reads. Compared to other leading assemblerson both simulated and real datasets, TransComb consist-ently performs the best. First, in comparison with othergenome-guided assemblers, TransComb more accuratelyidentifies the boundaries of gene loci, as well as exonsand junction sites within a gene. In lowly expressedgenes, some exons may not be fully covered by reads,which results in an exon or a gene being separated intotwo or more parts. Using paired-end information,TransComb is able to repair some of the broken geneloci and even exons which are not spanned by reads andthus builds more accurate splicing graphs. Second,TransComb is able to recognize pseudo exons, each con-sisting of two consecutive exons and the intron betweenthem, which may result from wrongly mapping reads tothe intron areas, and is capable of distinguishing the ac-tual exons from the pseudo ones. Third, it exploits acombing strategy to search for a path cover on a so-called weighted junction graph, but not a minimum pathcover, which has been found to be not always optimal.In contrast, StringTie and Traph were developed basedon a network flow model along with a greedy approachfor assembling transcripts, which are fragile especially incomplicated splicing graphs. Fourth, TransComb subtlyintegrates paired-end information into its process inorder to correct errors caused by either sequencing ormapping. In addition, TransComb is also superior to theassemblers compared here with regard to computationalresource usage. It runs the fastest except for StringTieand uses the least memory.TransComb is comparable to the other assemblers in

transcript expression level estimation, just slightly infer-ior to StringTie but much superior to Traph. It has beenobserved that exclusively using the seed edges reducesthe adverse impact of unexpected errors and biases onthe estimation of expression levels of assembledtranscripts to some extent. We believe that our com-binatorial approach in conjunction with the statisticalnormalization model will further improve the estimatorin the future.In conclusion, TransComb is essentially distinct from

any previous transcriptome assembler. It attempts toachieve a global solution by (1) combing in- and out-edges at each node of splicing graphs; (2) guiding deci-sion at each edge in a so-called weighted junction graph

using paired-end reads; and (3) extending the currentpath based on the weighted junction graph. The softwarehas been developed to be user-friendly. It may play acrucial role in new discoveries in transcriptome studiesusing RNA-seq, especially in complicated human dis-eases related to abnormal splicing events and expressionlevels, such as cancers.

MethodsConstruction of splicing graphsSplicing graphs have played a fundamental role in thedevelopment of TransComb, which are constructed ateach gene locus based on alignments of mapping readsto a reference genome using TopHat2. Aligned reads arefirst clustered into different gene loci and then theexon–intron boundaries and exon–exon junctions foreach gene are derived from junction reads or paired-endreads. Paired-end reads are also used to repair fragmen-ted exons due to low gene expression levels and a slidingwindow technique is developed to correct the wronglymerged exons due to sequencing or mapping errors.Then, for each gene locus, an exon is represented by anode and two nodes are connected by a directed edge ifand only if there is a splicing junction between them.The coverage of each edge is defined as the number ofreads which span the corresponding junction. Paired-endinformation between every two incident edges is also re-corded for further use (Additional file 1: Figure S2a). In anideal case, the splicing graphs would correspond one-to-one to all the (expressed) genes.Theoretically, a splicing graph captures all possible al-

ternative splicing events, with the nodes in the graphcorresponding to continuous regions in the genome thatare uninterrupted by any splicing event and directededges corresponding to splicing junctions betweenexons. It is worth mentioning that a node in splicinggraphs does not necessarily correspond to a true exon ina gene; instead, it may be only a partial exon as illus-trated in Additional file 1: Figure S2b.

Recovery of a set of paths from a splicing graphBased on the splicing graphs constructed above,TransComb recovers the most likely expressed tran-scripts by combing the junction edges on each spli-cing graph. The combing strategy is achieved by thefollowing steps (see Additional file 1: Figure S1 for aflow chart).

Construction of junction graphsTo resolve the ambiguities in linking the in- and out-splicing junctions at each exon with multiple splicingjunctions, we define another directed acyclic graph J,named the junction graph, weighted on both its nodesand edges. In J, each node i represents a junction edge


in the splicing graph and two nodes i and j are con-nected by a directed edge (i, j) if and only if i and j cor-respond to two incident edges in the splicing graph (asshown in Fig. 1, edges 1 and 2 are incident edges, whileedges 1 and 3 are not). Each node is weighted by thecoverage of its corresponding edge in the splicing graph,while each edge is suitably weighted based on the noveluse of a bin-packing strategy and paired-end informa-tion, described in the following two sections.

Weighting edges in junction graphs by the bin-packingstrategyTo decide whether or not two consecutive nodes in ajunction graph come from a common transcript, we de-veloped a rule to assign to each edge in the junctiongraph a weight of either 0 or 1, with 1 meaning that thetwo nodes of the edge would possibly come from a sin-gle transcript and 0 meaning they will never do so. Therule was designed by using a variant of the traditionalbin-packing model which has been well studied in thefield of operations research.For each node v with both in-edges and out-edges in a

splicing graph, it is required to optimally link betweenin-edges and out-edges such that two linked edges in thesplicing graph which correspond to two nodes in thejunction graph come from a common transcript with ahigher probability of being correct. Assume that node vhas n in-edges and m out-edges with n ≥m, then Trans-Comb identifies each in-edge as an item with its size sjbeing the coverage of the in-edge and each out-edge as abin with its capacity ci being the coverage of the out-edge. The optimum link between in-edges and out-edgesis then achieved by optimally packing the items (in-edges) into the bins (out-edges). To solve this optimalpacking problem, we define a binary variable xij, withxij = 1 if item j is packed into bin i, and 0 otherwise. Thisproblem can be reduced to find an optimum solution {xij}to minimize the following quadratic function on the con-dition that each item goes into one and only one bin andeach bin receives at least one item:

z ¼X

i¼1;…;m

ci−X

j¼1;…;n

sjxij

!2

s:t:

X

i¼1;…;m

xij ¼ 1 ∀j ¼ 1;…; nX

j¼1;…;n

xij ≥ 1 ∀i ¼ 1;…;mxij ∈ 0; 1f g

8>>><

>>>:

ð1Þ

The same formula was used in BinPacker, a de novotranscriptome assembler. As stated in BinPacker, thisquadratic programming can be quickly solved. Based onthe solution {xij} of Eq. 1, we assign to each edge con-necting nodes i and j in the junction graph the weightw(i, j) = xij. Intuitively, two nodes connected by an edgeof weight 1 would come from a common transcript with

a higher probability of correctness, while an edge ofweight 0 is considered as having no information support-ing this edge being included in any expressed transcriptand so it will never exist in any predicted transcript.

Updating edge weights in junction graphs using paired-endinformationAll paired-end information between two incident edgesin each splicing graph was recorded, as were constructedsplicing graphs. We are guided by this kind of paired-end information to update the edge weights of the corre-sponding junction graph as follows. For each two inci-dent edges i and j in a splicing graph, if at least twomate pairs supporting these two edges are recorded (seeAdditional file 1: Figure S2A for details), then the weightof the edge connecting these two corresponding nodes inthe junction graph is replaced by 2. Therefore, two nodesconnected by an edge of weight 2 in the junction graphare most likely to come from a common transcript.Clearly, the weighted junction graph defined above will

guide us to more accurately extract the full-length tran-scripts. Based on the junction graph, TransComb is ableto more accurately link the in- and out-splicing junctionsat each exon. The substantial difference between Trans-Comb and the other assemblers is that TransComb workson the junction graphs rather than the splicing graphs onwhich the others do. On the other hand, the extensionstrategy developed for TransComb is another unique fea-ture distinguishing it from the other assemblers.

Recovery of full-length transcripts from weighted junctiongraphsTo extract a transcript-representing path in the junctiongraph, we first choose a node with the largest weight asthe seed node, which is excluded from any predictedpath. The seed node is then extended to one of its rightneighboring nodes according to the weights of their cor-responding edges. An edge of weight 2 has the first pri-ority and an edge of weight 1 has the second priority forextension. We keep extending towards the right until weencounter a node without out-going edges. Similar ex-tension is done towards the left and then a transcript-representing path is predicted. This procedure is re-peated until all the nodes in the junction graph havebeen included in predicted paths (see Additional file 1for detailed description of the path extension procedurevia pseudo code).It is clear that in each extension towards left or right,

TransComb always extends the current path to theneighboring node supported by either bin-packing orpaired-end information and so each predicted path has avery high probability to represent an expressed tran-script no matter whether its expression level is low orhigh.


Estimation of expression levels of the recoveredtranscriptsBased only on the seed nodes used during the path ex-tension from the junction graph, we developed a verysimple but unusual method for expression level estima-tion (see Additional file 1 for a detailed description ofthis method).

Additional file

Additional file 1: Supplementary materials. This file contains details ofsplicing graph construction, additional comparisons with other methods,and supplementary figures and tables. (PDF 7353 kb)

AbbreviationsNCBI: National Center for Biotechnology Information; RNA-seq: RNAsequencing; SRA: Sequence Read Archive

AcknowledgmentsThe authors wish to thank Robert W. Robinson for providing assistance withwriting the manuscript.

FundingThis work was partially supported by the National Natural Science Foundationof China (codes 61432010, 61272016, and 31571354) and the USA NationalScience Foundation (grants DBI-1262107 and IIS-1646333). The funders had norole in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.

Availability of data and materialsThe TransComb algorithm is implemented in C++ and freely availableunder the GNU General Public License (GPL). A version of the sourcecode has been deposited in the repository Zenodo (https://zenodo.org/)with DOI: 10.5281/zenodo.61994.

Authors’ contributionsConceived and designed the experiments: GL. Performed the experiments:JL. Analyzed the data: JL and TY. Contributed reagents/materials/analysistools: JL and TJ. Wrote the paper: GL, JL, and TY. Designed the software usedin analysis: JL and TY. Oversaw the project: GL. All authors read and approvedthe final manuscript.

Competing interestsThe authors declare that they have no competing interests.

Author details1School of Mathematics, Shandong University, Jinan 250100, China.2Department of Computer Science and Engineering, University of California,Riverside, CA 92521, USA. 3MOE Key Lab of Bioinformatics and BioinformaticsDivision, TNLIST/Department of Computer Science and Technology, TsinghuaUniversity, Beijing, China. 4Institute of Integrative of Genome Biology,University of California, Riverside, CA 92521, USA.

Received: 27 July 2016 Accepted: 28 September 2016

References1. Blencowe BJ. Alternative splicing: new insights from global analyses. Cell.

2006;126(1):37–47.2. Zhou L, Hang J, Zhou Y, Wan R, Lu G, et al. Crystal structures of the Lsm

complex bound to the 3’ end sequence of U6 small nuclear RNA. Nature.2014;506(7486):116–20.

3. Sharon D, Tilgner H, Grubert F, Snyder M. A single-molecule long-readsurvey of the human transcriptome. Nat Biotechnol. 2013;31(11):1009–14.

4. Black DL. Mechanisms of alternative pre-messenger RNA splicing. Annu RevBiochem. 2003;72:291–336.

5. Matlin AJ, Clark F, Smith CW. Understanding alternative splicing: towards acellular code. Nat Rev Mol Cell Biol. 2005;6(5):386–98.

6. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, et al. Alternative isoformregulation in human tissue transcriptomes. Nature. 2008;456(7221):470–6.

7. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, et al. Full-lengthtranscriptome assembly from RNA-Seq data without a reference genome.Nat Biotechnol. 2011;29(7):644–52.

8 Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, et al. StringTieenables improved reconstruction of a transcriptome from RNA-seq reads.Nat Biotechnol. 2015;33(3):290–5.

9 Maretty L, Sibbesen JA, Krogh A. Bayesian transcriptome assembly. GenomeBiol. 2014;15(10):501.

10 Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, et al. Transcriptassembly and quantification by RNA-Seq reveals unannotated transcriptsand isoform switching during cell differentiation. Nat Biotechnol.2010;28(5):511–5.

11 Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, et al. Abinitio reconstruction of cell type-specific transcriptomes in mousereveals the conserved multi-exonic structure of lincRNAs. NatBiotechnol. 2010;28(5):503–10.

12 Feng J, Li W, Jiang T. Inference of isoforms from short sequence reads.J Comput Biol. 2011;18(3):305–21.

13 Li W, Feng J, Jiang T. IsoLasso: a LASSO regression approach to RNA-Seqbased transcriptome assembly. J Comput Biol. 2011;18(11):1693–707.

14 Mezlini AM, Smith EJ, Fiume M, Buske O, Savich GL, et al. iReckon:simultaneous isoform discovery and abundance estimation from RNA-seqdata. Genome Res. 2013;23(3):519–29.

15 Li W, Jiang T. Transcriptome assembly and isoform expression levelestimation from biased RNA-Seq reads. Bioinformatics. 2012;28(22):2914–21.

16 Tomescu AI, Kuosmanen A, Rizzi R, Makinen V. A novel min-cost flowmethod for estimating transcript expression with RNA-Seq. BMC Bioinf.2013;14 Suppl 5:S15.

17 Canzar S, Andreotti S, Weese D, Reinert K, Klau GW. CIDANE: comprehensiveisoform discovery and abundance estimation. Genome Biol. 2016;17:16.

18 Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions withRNA-Seq. Bioinformatics. 2009;25(9):1105–11.

19 Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, et al. TopHat2: accuratealignment of transcriptomes in the presence of insertions, deletions andgene fusions. Genome Biol. 2013;14(4):R36.

20 Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctionsfrom paired-end RNA-seq data by SpliceMap. Nucleic Acids Res.2010;38(14):4570–8.

21 Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants andsplicing in short reads. Bioinformatics. 2010;26(7):873–81.

22 Liu J, Li G, Chang Z, Yu T, Liu B, et al. BinPacker: packing-based de novotranscriptome assembly from RNA-seq data. PLoS Comput Biol.2016;12(2):e1004772.

23 Chang Z, Li G, Liu J, Zhang Y, Ashby C, et al. Bridger: a new framework for denovo transcriptome assembly using RNA-seq data. Genome Biol. 2015;16:30.

24 Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, et al. De novotranscriptome assembly with ABySS. Bioinformatics. 2009;25(21):2872–7.

25 Li RQ, Yu C, Li YR, Lam TW, Yiu SM, et al. SOAP2: an improved ultrafast toolfor short read alignment. Bioinformatics. 2009;25(15):1966–7.

26 Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics.2012;28(8):1086–92.

27 Peng Y, Leung HC, Yiu SM, Lv MJ, Zhu XG, et al. IDBA-tran: a more robustde novo de Bruijn graph assembler for transcriptomes with unevenexpression levels. Bioinformatics. 2013;29(13):i326–334.

28 Steijger T, Abril JF, Engstrom PG, Kokocinski F, Hubbard TJ, et al.Assessment of transcript reconstruction methods for RNA-seq. NatMethods. 2013;10(12):1177–84.

29 Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, et al. Modelling andsimulating generic RNA-Seq experiments with the flux simulator. NucleicAcids Res. 2012;40(20):10073–83.


dx.doi.org/10.1186/s13059-016-1074-1

https://zenodo.org/

http://dx.doi.org/10.5281/zenodo.61994

transcomb: genome-guided transcriptome assembly via ... · method open access transcomb:...

Documents