evolinc: evolutionary analysis of lincrna

Evolinc: A Computational Pipeline for Comparative Genomic and Transcriptomic Analyses of Long Non-Coding RNAs from large RNA-Seq datasets

Upendra Kumar Devisetty1, Andrew D.L. Nelson2, Kyle Palos2, Asher K Haug-Baltzell1, Eric Lyons1, and Mark A. Beilstein2 1 CyVerse, Thomas W. Keating Bioresearch Building, University of Arizona, 1657 E. Helen street, Tucson, AZ 85721; 2Department of Plant Sciences, University of Arizona, 303 Forbes Hall,

1140 E. South Campus Drive, Tucson, AZ 85721

Abstract Transcriptomic analyses from across eukaryotes have led to the conclusion that most of the genome is transcribed at some point in the developmental trajectory of an organism. One class of these transcripts is termed long noncoding RNAs (lncRNAs) which lacks coding potential and has a length >200 base pairs (bp). Reported lncRNA repertoires in mammals vary, but are commonly in the thousands to tens of thousands of transcripts, accounting for ~90% of the genome. With the recent advances in sequencing technology that have opened a new era in RNA and DNA studies and subsequent generation of large datasets, attention has focused on understanding the evolutionary dynamics of lncRNAs, particularly their conservation within genomes. To facilitate lncRNA discovery and comparative analyses at the genomic and transcriptomic level from large RNA-Seq datasets, we present Evolinc, a computational pipeline that identifies long non-coding RNAs from transcriptome assembly files and then searches for homologs in other species. Using sequence similarity, Evolinc reconstructs families of homologous lncRNAs, aligns the constituent sequences, builds gene trees, and uses gene tree / species tree reconciliation to infer evolutionary processes. The novelty of this computational approach is that it allows the user to investigate factors affecting lincRNA diversity within a large number of species. This approach is scaleable, working on one to thousands of lncRNAs and can perform comparisons between both large and small genomes. Evolinc is useful not only for inferring mechanisms affecting lncRNA diversity, but also for identifying lncRNAs in non-model systems where genome data is available. For ease of use we have pre-packaged Evolinc into an app available in the CyVerse’s Discovery Environment.

Long non-coding RNAs: •  Have little or no coding potential (<100 amino acids). •  Lowly expressed (May be tissue specific). •  On average, displays poor sequence conservation. Difficult to predict computationally. •  Very few have confirmed function. •  Difficult to distinguish between background noise and “functional” long non-coding RNAs. •  Behave and look similar to protein-coding RNAs. •  Many are Pol II transcribed, poly-A tailed, and have a 5’ TMG cap. •  Can be multi-exonic.

Long non-coding RNAs make up a large, unknown percentage of eukaryotic transcriptomes

LncRNAs are classified based on genomic location

Long non-coding

Long intergenic

Gene associated

Overlapping Natural antisense

Based on known or predicted functions

Chromatin Associated Scaffold (Cis and Trans) •  Transcription regulation (both up and

down-regulation). •  Modification of chromatin state. •  Link between protein and chromatin.

Molecular Decoy •  Rapid mechanism by which to titrate away

proteins/RNA from another target. •  SnoRNA precursors in mammals.

RNA and Protein Processing •  Antisense regulation. •  RNA modification (SnoRNAs). •  Translation (tRNAs). •  Easiest to identify.

Chromatin-Independent Scaffold •  Bring different protein and RNA

components together in one place (telomerase).

A large proportion (>90%) of lncRNAs have no known function, and it is unclear how to group RNAs into these classes without functional testing.Given the abundance and low expression of lncRNAs, novel putative lncRNAs are identified each time RNA-seq is performed. What is needed is a way of filtering through putative lncRNAs to identify a useful set for functional analysis.

Ulitsky and Bartel, 2013; Wang and Chang, 2011; Cabili et al, 2011

Background

Evolinc-I workflow and results. A) LincRNA identification workflow and outputs generated by Evolinc-I. There are mainly four steps involved in the LincRNA identification. Evolinc-I generates a updated genome annotation file appending the novel lincRNA loci to the user-supplied reference annotation file (useful for differential gene expression analysis), BED file (for viewing in genome browser), fasta file (contains all isoforms for each lincRNA locus) and lincRNA demographics file. In addition, Evolinc generates a sequence file, BED file and final summary table for transcripts containing SOT (sense overlapping transcripts) and AOT (Antisense overlapping transcripts). B) Validation of Evolinc-I using RNA-Seq data used to identify lincRNAs from A. thaliana (Liu et al., 2012). Using Evolinc-I, 571 lincRNAs, as well as 2484 overlapping transcripts (584 AOTs and 1900 SOTs) were identified. Of the 571 lincRNAs, 198 were identified by Liu et al, 70 are novel and 176 were classified by Liu et al as either repeat containing (RCTU, contained simple repeats) or gene associated (GATU, within 500 bp of a known gene) transcription units. Evolinc-I identified all but 17 of the original Liu-lincRNAs. C) Validation of Evolinc-I using RNA-Seq data used to identify lincRNAs from humans (Cabili et al., 2011). Using Evolinc-I, 360 lincRNAs were identified. Of these 317 were also identified by Cabili et al., and 43 were novel. D) Table showing the exhaustive search for A. thaliana lincRNAs from ~12.5 billion publically available RNA-Seq reads (159 different sequencing runs) representing 9 different tissues

Evolinc-II: Comparative genomic and transcriptomic analysis of lincRNAs

Conclusions

Evolinc

In summary, we have developed a set of apps that streamline lincRNA identification and evolutionary analysis. Given the wealth of RNA-seq data available at NCBI’s SRA, and the computing capabilities of CyVerse, we believe that Evolinc will prove to be tremendously useful. Combining these resources, Evolinc can uncover broad and fine-scale patterns in the way that lincRNAs evolve and ultimately help in linking lincRNAs to their function.

•  Evolinc is a two-part workflow (Evolinc-I and Evolinc-II) to identify lincRNAs from an assembled transcriptome

file (GTF output from Cuffmerge/Cuffcompare) and then determine the extent to which those lincRNAs are conserved in the genome and transcriptome of other species.

Protein-coding

1. Filter by length ( > 200nt)

2. Filter by ORF length and similarity to known proteins (transdecoder, BLASTp)

3. Filter out transcripts with high similarity to known TEs (optional)

4. Test for overlap (+/- strand) with known PC and lincRNA genes

BED file for viewing in genome browser

Updated reference genome annotation file for differential expression analysis

Assembled transcripts

Data generation by Evolinc-I

.tsv

lincRNA demographics

.txt

Summary of lincRNAs identified

.fasta

Sequence of lincRNAs identified

* Also generated for AOT and SOT lncRNAs

A Evolinc-I

lincRNA identification

B

Evolinc-I on each tissue

Merge Evolinc-I analyses

lincRNAsAOTsSOTs

1900584

571

Liu-RCTULiu-GATU

Liu-lincRNANovel

198

70

197

106

C Cabili RNA-seq

TopHat/Cufflinks


Minimum of 3 reads/base

Identified in at least two tissues

Multi-exonic

Additional Cabili-lincRNA specific filtering

360 lincRNAs

317

43 Cabili or Known

Novel

Evolinc-I: Automated lincRNA identification and filtering pipeline









.tsv


.txt


.fasta



A Evolinc-I


B



lincRNAsAOTsSOTs

1900584

571

Liu-RCTULiu-GATU

Liu-lincRNANovel

198

70

197

106

C Cabili RNA-seq

TopHat/Cufflinks




Multi-exonic


360 lincRNAs

317

43 Cabili or Known

Novel









.tsv


.txt


.fasta



A Evolinc-I


B



lincRNAsAOTsSOTs

1900584

571

Liu-RCTULiu-GATU

Liu-lincRNANovel

198

70

197

106

C Cabili RNA-seq

TopHat/Cufflinks




Multi-exonic


360 lincRNAs

317

43 Cabili or Known

Novel

TissueInitial "U"

transcriptsEvolinc-I

NovelLiu or TAIR10

lincRNAs

Total lincRNAs identified

Number of SRAs

Number reads

mappedFlow ers 23240 436 488 924 24 6.88E+08

Cauline leaves 20887 286 376 662 38 8.56E+09Roots 21324 351 436 787 22 1.28E+09

Rosettes 19114 134 234 368 8 1.64E+08SAM 25556 339 353 692 16 2.94E+08

Seedlings 21348 253 335 588 31 5.79E+08Shoots 20198 189 273 462 7 2.09E+08Siliques 21719 353 402 755 2 4.71E+08

WP 19563 145 244 389 11 3.19E+08

Total (unique) ND 1135 1045 2180 159 1.26E+10

A

D

B C

Acknowledgements

lincRNA-1

1. Reiterative reciprocal BLAST of query lincRNA against all genomes2. Extract reciprocal sequences, group them by query3. Annotate sequence homolog based on genome and known lincRNA annotations for each species4. Align sequences using MAFFT.

ABCDE

GenomesSequence

familyQ1

B1D1E1

A1

Query genome

Repeat for all query lincRNAs (1-1000s)

1 2,3,4

A Family Building Tree buildingQ1

A1

B1D1E1

Q

AB

DE

C

Q1

A1B1

D1E1

C1 (lost)

Gene treeinferred from

sequence family

Species treeprovided by user

Reconciled gene tree with duplication/loss events denoted

Repeat for all sequence families (1-100s)

Optionalphylogenetic

analysis

Data generated by Evolinc-II

.tsv

Family buildingresults for each query lincRNA

.fasta

Sequence of lincRNA homologs(grouped by query)

Graph depicting percent recovered loci in each species

Q1

B1D1E1

A1

Query-centric BED file depicting conserved regions

Q1

A1B1

D1E1

C1 (lost)

Reconciled gene tree depicting duplication and loss events

+ =

x 1 x 1-1000s x 1 x 1 x 1-1000s

B C

0

20

40

60

80

100

A. tha A. lyr C. rub L. ala B. rapB. ole S. par E. sal A. araT. has

Randomized bins (n = 200)Grouped by chromosome (n = 5)Range of E-values (-20 to -1)

0

20

40

60

80

100

A. lyr C. rub L. ala B. rap B. ole S. par E. sal A. ara T. has

Perc

ent A

. tha

liana

ho

mol

ogou

s ln

cRN

A lo

ci

reco

vere

d in

eac

h sp

ecie

s TAIR10 lncRNAs Lui-lincRNAsNovel-lincRNAs Combined (from Evolinc II)

Perc

ent h

omol

ogou

sLi

u-lin

cRN

A lo

ci re

cove

red

in e

ach

spec

ies

10203040506070 mya

100

57

3317

1615

1922

94

0 20 40 60 80

100

A. lyr

A. tha

C. rub

L. ala

B. rap

B. ole

S. par

E. sal

A. ara

T. has

Percent homologous loci recovered

A

10203040506070 mya

E

D

Evolinc-II workflow and results. A) Workflow and outputs generated by Evolinc-II. Evolinc-II can be divided into two stages: family building (left) and tree building (right). There are mainly four steps involved in the family building process. These steps are performed to identify conserved lincRNA loci in other species. If the user chooses, the tree building portion is run, which allows the user to infer duplication and loss events that have occurred at the lincRNA locus. Evolinc-II generates sequence files containing lincRNA families with all identified sequence homologs, a table that contains the lincRNA loci based on depth of conservation, overlapping features and transcriptional evidence, query-centric bed file (for viewing in genome browser) and a reconciled gene tree with predicted duplications and deletions. B) Barplot showing the recovery of similar percentages of sequence homologs for the Liu-lincRNA datasets using Evolinc-II with different search criteria. C) Barplot showing the percent of loci recovered using the Evolinc-I identified lincRNAs in A. thaliana as query. D) Percent of sequence homologs recovered for an A. lyrata lincRNA dataset identified by Evolinc-I from > 100 million reads. E) Two examples of output generated by Evolinc-II showing the duplication and loss events for two lincRNA loci.

We thank Evan Forsythe (University of Arizona) and Dr. Molly Megraw (Oregon State University) for thoughtful comments pertaining to Evolinc parameters. This project is funded by NSF-MCB #1409251 and NSF-PGRP. We would also like to thank the CyVerse Discovery Environment for hosting the Evolinc apps.

•  Software availability: Evolinc is currently available as two apps in CyVerse Discovery Environment (https://de.iplantcollaborative.org/de/) along with tutorial and sample data via (https://wiki.cyverse.org/wiki/display/TUT/Evolinc+ in+the+Discovery+Environment). Both the apps are currently versioned at v1.0 and v2.0 is currently under progress. The Docker images of Evolinc versions (1.0 and 2.0) will soon be available on Dockerhub. Atmosphere images will be also be created soon.

Evolinc-v1.0

Evolinc-v2.0

The School of Plant Sciences