catalyzing plant science research with rna-seq

Manjappa Ph. D. Scholar

Dept. of Genetics & Plant Breeding

UAS, GKVK, Bengaluru, India

Catalyzing plant science research with RNA-seq

1

Central dogma of molecular biology

Transcriptome

(mRNA, rRNA, tRNA, and other non-coding RNA)

2

Why to study transcriptome ?

It reflects the genes that are being actively expressed at any

given time (expression profiling)

Expression level of mRNAs in a given cell population varies

How an organism adapt to the developmental cues and

environmental fluctuations.

3

Quantify the changing expression levels of each transcript during development and under different conditions

Catalogue all species of transcript (mRNAs, non-coding RNAs & small RNAs)

Determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications

Aim of transcriptomics

4

Microarray Technology

5

Limitation in microarray technique

Reliance upon knowledge of genome sequence

High background levels owing to cross hybridization

Limited dynamic range of detection owing to background & saturation of signals

Comparing expression levels across different experiments is often difficult & can require complicated normalization methods

Sanger sequencing of cDNA or EST libraries:

- Relatively low throughput, expensive & generally not quantitative

Tag-based methods (SAGE, CAGE & MPSS):

high throughput & precise, ‘digital’ gene expression levels

Most are based on expensive Sanger sequencing technology, & a significant portion of the short tags cannot be uniquely mapped to the reference genome

only a portion of the transcript is analyzed and isoforms are generally indistinguishable from each other

6

Wang et. al, RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genetics 10, 57-63, 2009).

Next generation sequencing (NGS)

Sample preparation

Data analysis:Mapping readsVisualization (Gbrowser)De novo assemblyQuantification

RNA-sequencing

7

RNA-seq vs. microarray

• RNA-seq can be used to characterize novel transcripts and splicing variants as well as to profile the expression levels of known transcripts (but hybridization-based techniques are limited to detect transcripts corresponding to known genomic sequences)

• Detect large dynamic range of expression levels (9,000 fold) compared to microarray (100-few-hundred fold

• RNA-seq has higher resolution than whole genome tiling array analysis

• In principle, mRNA can achieve single-base resolution, where the resolution of tiling array depends on the density of probes

• High levels of reproducibility, for both technical and biological replicates

• RNA-seq can apply the same experimental protocol to various purposes, whereas specialized arrays need to be designed in these cases

• Detecting SNPs (needs SNP array otherwise)• Mapping exon junctions (needs junction array otherwise)• Detecting gene fusions (needs gene fusion array otherwise)

8

RNA-seq and microarray agree fairly well only for genes with medium levels of expression

Saccharomyces cerevisiae cells grown in nutrient-rich media. Correlation is very low

for genes with either low or high expression levels.

9

Advantages of RNA-Seq compared with other transcriptomics methods

10

RNA Seq helps to look at

Alternative gene spliced transcripts

Post-transcriptional modifications

Gene fusion

Mutations/SNPs

Changes in gene expression

• Used to determine exon/intron boundaries

• Verify or amend previously annotated 5’ and 3’ gene boundaries.

• Also includes miRNA, tRNA, and rRNA profiling

11

Alternative splicing

12

Library construction

RNA fragmentation (RNA hydrolysis or nebulization) & cDNA fragmentation (DNase I treatment or sonication)

Bioinformatic challenges

Devt. of efficient methods to store, retrieve and process large amounts of data, which must reduce errors in image analysis and base-calling and remove low-quality reads.

13

Challenges for RNA-Seq

bias at depleted 5′ and 3′ ends

bias at 3′ ends

14

Applications of RNA Seq

• First draft of Arabidopsis thaliana genome sequence (2000); its annotation continues to be improved

• Large amounts of Sanger sequencing-generated EST data provided the initial basis for gene identification and expression profiling

Expensive, time consuming, inherently biased against low-abundance transcripts & are typically enriched in transcript termini

• RNA-seq circumvents these limitations and provides accurate resolution of splice junctions and alternative splicing events

• Arabidopsis transcriptome survey using Illumina shows

- At least 42% of intron-containing genes are alternatively spliced (Filichkin et al., 2010)

- 61% when only multi-exonic genes are sampled

- ~48% of rice genes (Lu et al.,2010)

1. IMPROVING GENOME ANNOTATION WITH TRANSCRIPTOMIC DATA

15

Contd…

• Mining RNA seq data in search of TSS variation is improving gene structure annotation and alternative TSSs have been detected in ∼10,000 loci in Arabidopsis and rice

(Tanaka et al., 2009).

• An ideal genome annotation would identify

Genes that show invariant transcript sequences

Those that exhibit alternative splicing and

Link these events to specific spatial, temporal, developmental, and/or environmental cues.

• Abiotic stress in Arabidopsis can increase or decrease the proportions of apparently unproductive isoforms for some key regulatory genes, supports alternative splicing is an important mechanism in the regulation of gene function

(Filichkin et al., 2010)

16

Contd…

• Polymorphisms between different A. thaliana accessions is one SNP per ∼200 bp.

• Complete re-sequencing of the transcriptomes and annotation of different accessions helps to interpret the functional consequences of polymorphism

• Utilizing genomic and transcriptomic data for in silico gene prediction results in a more reliable annotated genome, with Information on SNPs, indels, splice variants and expression variation

17

Generating genomic and enabling proteomicresources for “non-model” species

• Published plant genome sequences represents very small fraction of plant taxonomic diversity

• Study of “non-model” species challenging

• de novo sequencing of the transcriptome to generate genetic resources1. Eucalyptus (mizrachi et al., 2010)2. Garlic (sun et al., 2012)3. Pea (franssen et al., 2011), 4. Chestnut (barakat et al., 2009)5. Chickpea (garg et al., 2011)6. Olive (alagna et al., 2009)7. Safflower (lulin et al., 20128. Japanese knotweed (Hao et al., 2011).

Gene annotation relies on identifying homologs, & ideally orthologs, in species with an annotated genome (if no appropriate EST databases are available)

If not, A. thaliana genome sequence (Gold std.)

Further confirmation; interrogating additional plant databases

Annotation with pre-existing EST database Eg: melon (Dai et al., 2011)

Same function

different function

18

• De novo RNA-seq to identify genetic polymorphisms (molecular breeding), wherein multiple cultivars or close-related species with variations in traits of interest are sequenced and genetic variation is identified.

Allows generation of molecular markers to facilitate progeny selection and molecular genetics research

Ex: 12,000 SSRs in a single RNA-seq analysis of sesame (earlier only 80 SSRs), on average 1 genic-SSR per ∼8 kb

(Zhang et al., 2012)

5,234 SNPs in transcriptomes of five winter rye inbred lines. Used in a high-throughput SNP genotyping array

(Haseneyer et al., 2011)

• Comparative sequence analysis of radish RNA- seq data and Brassica rapa genome sequence lead to the discovery of 14,641 SSRs

Contd…

(Wang et al., 2012)

19

RNA Seq application to advance the field of proteomics.

• Effective proteome profiling is generally considered to depend heavily on the availability of a high-quality DNA reference database

• High-throughput mass spectrometry-based protein identification relies on the availability of an extensive DNA sequence database in order to match experimentally determined peptide masses with the theoretical proteome generated by computationally translating transcripts

• RNA- seq based transcriptome profiling can provide an effective data set for proteomic analysis of non-model organisms

20

“RNA-Seq, facilitates thematching of peptide massspectra with cognate genesequence”

• To test this, quantitative analysis of the proteomes of pollen from domesticated tomato (Solanumlycopersicum) and two wild relatives

• RNA-Seq (454 pyrosequencing); >1200 proteins were identified

No major qualitative or quantitative differences were observed in the characterized proteomes

either with a highly curated community database of tomato sequences or the RNA-Seq database21

Characterizing temporal, spatial, regulatory, and evolutionary transcriptome landscapes

Temporal transcriptome

• RNA-seq is increasingly being adopted to examine transcriptional dynamics

• Ananalysis of transcriptome of grape berries during three stages of devt. identified >6,500 genes that were expressed in a stage-specific manner (Zenoni et al.,2010)

• Radish >21,000 genes differentially expressed at two developmental stages of roots, includes genes strongly linking root development with starch and sucrose metabolism and with phenylpropanoid biosynthesis.

(Wang et al., 2012)

22

Objective: To understand the molecular mechanisms underlying tuberous root formation and development.

• Radish (R. sativus) cultivar ‘Weixianqing’.

• Samples; cultivar ‘Weixianqing’.

• hypocotyl (1 cm, 7DAS) & true root (1 cm, 20 DAS (RLSS, the stage of cortex splitting), 10 seedlings of each were pooled together

• Illumina paired-end sequencing technology GAII platform (BGI; Shenzhen, China)

• Gene annotation: Comparative genome analysis between radish and Brassica rapa. Unigenes were aligned with sequences in NCBI non-redundant protein (Nr) database, Swiss-Prot protein database, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database & Cluster of Orthologous Groups (COG) database using BLASTxAnnotation by using Blast2GO program

23

• Sequence similarity search was conducted against the NCBI Nr (85.51 %), Nt (90.18%) and Swiss-Prot protein databases (54%) using the BLASTx algorithm

• 21,109 unigenes were assigned GO terms.

Functional annotation of all non-redundant unigenes

Gene Ontology classification of assembled unigenes

(9,271; 43.92 %)

24

Transcript differences between RESS and RLSS

13,453

8,389

To understand the functions of

DEGs, mapped all the DEGs to

terms in the KEGG database &

found 29 pathways were

significantly enriched.

carbohydrate, energy, lipid,

amino acid, other amino acids,

terpenoids and polyketides,

metabolism and biosynthesis

of other secondary metabolites

20 (starch and sucrose metabolism) and 25 (phenylpropanoid

biosynthesis) unigenes significantly up-regulated, play a critical roles in

regulating radish tuberous root formation. Also confirm finding of radishroot is rich with carbohydrates and phenolic compounds. 25

starch and sucrose metabolism

(303) and phenylpropanoid

biosynthesis (177) two

predominant groups

EST-derived SSR detection

26

• Previous gene expression studies using EST sequencing, spotted microarrays & Affymetrix Gene Chip tech. (based on prior sequence)

• Provides only a fragmented picture of transcript accumulation patterns.

• RNA Seq to 7 tissues (leaf, flower, pod, two stages of pod-shell, root, nodule) & 7 seed devt. Stages of BC5F5 plant G. max

• Compare transcript reads with recent genome sequence (assembly Glyma1.01)

• Potential model for future RNA-Seq atlases27

Mapping of short-read sequences:

• Illumina Genome analyser-II: produced 5.8-8.9 mill. 36-bp reads for 7 non-seed tissues & 2.7-9.6 mill. 36-bp reads for seed tissues

• Alignment program GSNAP was used to map the reads to two reference genomes: G. max and Bradyrhizobium japonicum.

• Digital gene expression analysis: 46,430 genes identified as “high confidence” (correlation to full length cDNAs, ESTs, homology, & ab initio methods)

• Of which 41,975 (90.4%) genes were transcriptionally active

Expression and gene structure

Coding regions of transcriptionally inactive genes were smaller and had a lower GC content

28

Hierarchical clustering of transcriptional

profiles in 14 tissues.

Tissue-specific analysis of the soybean transcriptome

Relative expression levels based on Z-score

analysis (3.4-3.6 more tissue specific)

early seed devt. stages late seed devt. stages

aerial tissues underground tissues

Z = (X-μ)/sd

Tissue specific

Tissue specific

29

Heatmap of the Legume Specific GenesHeatmap of top 500 highest expressed genes

Some legume specific genes

have tissue specific transcription

Glyma06g08290

Glyma04g08220 (Oleosin)Glyma02g01590 (lectin precursor 1)

30

General trends in expression profiles for all genes tissue by tissue comparison (Fishers Exact test)

Higher transcriptional level

Importance: Understand Gene functions and molecular process occur during two stages.GOslim analysis; between Seed 25 DAF & seed 28 DAF, seed 35 & 42 DAF, stably expressednutrient reservoir activity & urease activity. Which are imp activity in seed devt. 31

Genes structure and tissue specific gene expression

• Underground tissue have larger first exon, aerial has higher # of exons.

• Significant difference in total transcription length among tissue due to varying intron length.

• No significant difference between GC content and tissue specificity

32

Boxplot Dendrogram of preferential expressed genes in seed development

RPKM normalized log2-

transformed expression gene profiles

33

Summary

• RNA Seq-Atlas provides

• A record of high-resolution gene expression in a set of 14 diverse tissues

• Hierarchical clustering of transcriptional profiles for 14 tissues

• Relationship between gene structure and gene expression

• Tissue-specific gene expression of both the most highly-expressed genes and the genes specific to legumes in seed development and nodule tissues

• A means of evaluating existing gene model annotations for the Glycine max genome

34

Spatial transcriptome

• Most RNA-seq analyses target whole organs, or sets of organs, which inherently prevents the identification of cell or tissue type transcripts, and thus spatially coordinated structural and regulatory gene networks.

• RNA-seq analysis of discrete tissues or cell types: Spatial information and increase the depth of sequence coverage

Ex: >1000 genes have specifically or preferentially expressed in Arabidopsis male meiocytes

• Acquiring tissue or cell-specific samples with any degree of precision and minimal contamination is often technically difficult

35

Methods of isolation of single cells

36

Contd…

• Matas et al. (2011): LCM + RNA-seq (454 pyrosequencing) transcriptomes of 5 principal tissues of the developing tomato fruit pericarp.

~ 21,000 unigenes identified & more than half showed ubiquitous (57%) expression, while other showed cell type-specific expression

• Takacs et al. (2012): LCM + RNA-seq (Illumina-based NGS) study of the ontogeny of maize shoot apical meristem

59% of genes expressed ubiquitously

• A number of mammalian tissues also shown a high proportion of ubiquitously expressed transcripts

“These studies may indicate that this is a common feature of eukaryotes” (Ramsköld et al., 2009).

37

To study plant responses and adaptations to abiotic and biotic stresses

Aim: Elucidate genes and gene networks that contribute to sorghum’s tolerance to water-limiting environments with a long-term aim of developing strategies to improve plant productivity under drought Discovered >50 previously unknown drought- responsive genes.

38

Up Down

ABA ~2,300 ~2,600

PEG ~1,650 ~700

20 μM

8th day,

57.1 μM

20% PEG-8000

Transcript Analysis in Response to ABA and Osmotic Stress

Method

ABA in response to plant

stress, and its central role

in other pathways, (dormancy in leaf & seed)

LEA proteinWSI18 proteindehydrin

sugar

substratetransporter

peroxidase 6

39

A, brassinosteroid biosynthesisB, cytokinins degradationC, cytokinins glucoside biosynthesisD, ent-kaurene biosynthesisE, ethylene biosynthesis from methionine

F, gibberellin biosynthesisG, gibberellin inactivationH, IAA conjugate biosynthesisI, jasmonic acid

Networks of hormone pathways in ABA-treated plants

Shoots Roots

Box=Hormone-related

Circle=non-hormone-related

Down Up DE genesDark blue solid lines= ≥10 blue

long-dashed lines=6-9 light blue short-dashed= ≤5

Only the brassinosteroid and JA biosynthesis pathways, and cytokinin glucoside and IAA

conjugate biosynthesis pathways are directly connected via DE genes.

Indirect ‘cross-talk’ between the various hormones in response to osmotic stress and ABA40

Determining the genes of unknown

function that respond to drought orABA treatment across species

Decision tree used to determine which

genes and their orthologs were regulated by

drought/ABA across different species

Overlap of drought-responsive sorghum

genes of unknown function that had drought-

responsive orthologs of unknown function in

other species

41

(51) (82)

(183)

• RNA-seq used reveal massive changes in metabolism and cellular physiology of the green alga Chlamydomonas reinhardtii when the cells become deprived of sulfur

• studies of plant responses to pathogens

Ex: sorghum Bipolaris sorghicola(Mizunoetal.,2012)

• Complexities of the metabolic pathways associated with plant defense mechanisms

42

Study plant evolution and polyploidy.

• A comparison of the leaf transcriptome of an allopolyploid relative of soybean with two species that contributed to its homoelogous genome, allowed the determination of the contribution of the different genomes to the transcriptome (Ilut et al., 2012)

• Maize endosperm trascriptome analysis; discovered 179 imprinted genes and 38 imprinted long ncRNAs (Zhang et al., 2011)

• Transcriptome of 9 distinct tissues of three species of the Poaceaefamily (Brachypodium, sorghum & rice) to determine whether orthologous genes from these three species exhibit the same expression patterns (Davidson et al., 2012)

Only a fraction of orthologous genes exhibit conserved expression patterns

Orthologs in syntenic genomic blocks are more likely to share correlated expression patterns compared with non-syntenicorthologs.

These findings are important for crop improvement (seq transfer)

43

Hierarchical clustering of 27 tissues (9 tissues x 3

species) based on correlations of log2 FPKM mappedexpression values of 3-taxa single-copy (3x3) genes

Classification of Brachypodium, rice, andsorghum genes into orthologous groups

clustering of

corresponding tissue

extensive expression divergence within 3 · 3 genes

Red: single copy (2 X 2 & 3 X 3)

Black: multicopy (2 X N & 3 x N)

OrthoMCL 44

Genes within each k-means co-expression

cluster were categorized based on OrthoMCL

category assignments or as lineage-specific

single-copy (1 x 1) genes.

Co-expression analyses identify conservation

of expression among orthologs and paralogs

Proportions of genes with at least one corresponding paralog or ortholog in the same cluster

Portion of Poaceae orthologs and paralogs share same expression patterns across reproductive tissuesSome genes exhibited different expression phenotypes

45

Similar expression pattern in Poacea

which biological processes were over-represented in orthologs/paralogs category ?

Ortholog/paralogs

Gene ontology (GO) annotation

2 x N genes Stress-related functions (‘response to biotic stimulus’, ‘defense response’, ‘apoptosis’), lipid transport, secretion (‘exocytosis’), and general oxidation–reduction reactions.

3 x N (higher substitution rates)

Core metabolic functions; ‘translation, ATP biosynthesis, nucleosome assembly, and biosynthetic process & oxidation–reduction, response to wounding, sexual reproduction.

3 x 3 genes Essential functions: regulation of transcription’ (>1000 genes),‘protein folding’ (253 genes), ‘intracellular protein transport’(123 genes), and ‘glycolysis (91 genes)

2 x 2 genes protein amino acid phosphorylation, ‘regulation of transcription & response to oxidative stress

46

Relationship between synteny and expression patterns of orthologs

Syntenic gene pairs within collinear blocks of at

least five genes were identified for all pairwise

combinations of three Poaceae species

Distributions of Pearson’s correlation coefficients (PCC)

synteny plays a significant role in

evolution of gene expression, especially

in the case of duplicate and multicopy

genes 47

Identifying and characterizing novel non-coding RNAs

• Insilico analysis provides a rapid way to identify putative sRNAgenes

• RNA-seq technology represents an excellent means for sRNAdiscovery and validation

• Characterization of miRNAs regulatory functions to be facilitated by determining tissue-specific expression pattern

• RNA-seq was used to identify sRNAs from five Arabidopsis root tissues.

Some sRNAs expressed in all 5 tissues while others were tissue and developmental zone specific

• The frequency of alternative slicing at miRNA binding sites is significantly higher than that at other regions, suggesting that alternative splicing is a significant regulatory mechanism.

• sRNAs have been recently characterized in the context of association with epigenome modifications, including cytosine methylation of genomic DNA

48

From co-expression networks to integrative data analysis

• Sequencing whole transcriptomes provides a high degree of detail, but deriving useful biological information from a long list of expressed genes is typically not trivial

• Construct networks of co-expressed genes and to use gene ontology (GO) information to help highlight important gene candidates as critical components of functional networks

• Gene ontology enrichment analysis of RNA-seq data often illustrates the complexity of interacting pathways

Robust Functional networks

Transcriptome: RNA-seq

proteomics

metabolomics

No correlation

ex: Soybean

protein XCorrelation

Ex:Oil plam

mesocarp Fatty

acid

49

Bulked Segregant RNA-Seq

SNP 2 being closely related to the mutation to map

linkage disequilibrium between markers and causal gene is determined by quantifying the allelic frequencies between two samples

advantages:(i) Having a reference genome is not

a prerequisite(ii) Markers can be generated from

the experimental data(iii) Differential expression profiles (iv) Info on effects of mutant on

global patterns of gene expression

(v) Provide map position of a gene

Liu et al. (2012)

BSA requires polymorphic markers

50

>64,000 SNPs

Two alleles of a given SNP site should bedetected in approximately equal numbers ofRNA-Seq reads when considering both pools ofRNASeq data.

Only one allele of a SNP that is completelylinked to the causal gene should be presentamong the RNA-Seq reads from the mutantpool

In practice, as a consequence of Allele SpecificExpression and sampling bias, genes expressedat low levels, single allele of many SNPs aredetected in the mutant pool.

Empirical Bayesian approach used to estimatelinkage probability, i.e. probability of a SNPexhibiting complete linkage disequilibriumwith the causal gene. 51

>64,000 SNPs




Empirical Bayesian approach used to estimatelinkage probability, i.e. probability of a SNPexhibiting complete linkage disequilibriumwith the causal gene. 52

>64,000 SNPs




Empirical Bayesian approach used to estimatelinkage probability, i.e. probability of a SNPexhibiting complete linkage disequilibriumwith the causal gene.

gl3-ref allele in a non-B73/B73

53

Reference genome

The top 10 windows with the highest median

linkage probability were located at physicalposition ,183.5–185.2 Mb.

Fine mapping of gene.1.Mutant gene expression will often be down-regulated compared to the WT pool.2. Collections of SNPs tightly linked to mutant gene 3. SNPs linked to mutated gene can be used for gene cloning via chromosome walking.

54

• Not necessary to use tissue with mutant gene expression for BSR-Seq.

• However, if we collect tissue with expression we can also get additional expression data.

Resolution of mapping depends on1. # of individuals included in the bulks2. Sequencing depth3. Density of polymorphisms in mapping population

55

• International multi-disciplinary consortium; 1,000 plant sps. transcriptome data

• It is PPP project; funding of 75% from Govt. of Alberta, 25% by MuseaVentures. BGI-Shenzhen- sequencing at reduced costs & iPlant collaborative -computational informatics.

• Objectives:

1. Resolve many of the lingering uncertainties in species relationships, especially in the early lineages of streptophyte green algae and land plants

2. To identify gene changes associated with the major innovations in Viridiplantae evolution, such as multi-cellularity, transitions from marine to freshwater or terrestrial environments, maternal retention of zygotes and embryos, complex life history involving haploid and diploid phases, vascular systems, seeds and flowers

• Species selection; representations of all major lineages across the Viridiplantae(green plants), representing ~1 billion years of evolution, including flowering plants, conifers, ferns, mosses and streptophyte green algae.

56

Resources available

1. Access to raw and processed data:

Content; transcriptome assemblies, putative coding sequences, orthogroups and gene and species trees with related sequence alignments.

2. High performance computing and cloud-based services:

iPlant discovery environment (DE) web interface (tutorials and teaching materials available)

57

Phenylpropanoid synthesis pathway for Colchicum autumnale. Labelled rectangles are

proteins. Small circles are metabolites. Black lines show the KEGG pathway. Red lines show the

BioGRID interactions emanating from protein (K12355), which was interactively selected. A right-

click on the protein will display the inferred function and a link to the sequence(s)

Interactions & pathways

58

Conclusion

• RNA-sequencing is now well-established as a versatile platform with applications in an ever growing number of fields of plant biology research

• Ongoing developments in sequencing technologies, such as increased read lengths, greater numbers of reads per run

• Advanced computational tools to facilitate sequence assembly, analysis, and integration with orthogonal data sets will further accelerate the breadth and frequency of its adoption by plant scientists

59

catalyzing plant science research with rna-seq

Science