catalyzing plant science research with rna-seq
TRANSCRIPT
Manjappa Ph. D. Scholar
Dept. of Genetics & Plant Breeding
UAS, GKVK, Bengaluru, India
Catalyzing plant science research with RNA-seq
1
Central dogma of molecular biology
Transcriptome
(mRNA, rRNA, tRNA, and other non-coding RNA)
2
Why to study transcriptome ?
It reflects the genes that are being actively expressed at any
given time (expression profiling)
Expression level of mRNAs in a given cell population varies
How an organism adapt to the developmental cues and
environmental fluctuations.
3
Quantify the changing expression levels of each transcript during development and under different conditions
Catalogue all species of transcript (mRNAs, non-coding RNAs & small RNAs)
Determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications
Aim of transcriptomics
4
Microarray Technology
5
Limitation in microarray technique
Reliance upon knowledge of genome sequence
High background levels owing to cross hybridization
Limited dynamic range of detection owing to background & saturation of signals
Comparing expression levels across different experiments is often difficult & can require complicated normalization methods
Sanger sequencing of cDNA or EST libraries:
- Relatively low throughput, expensive & generally not quantitative
Tag-based methods (SAGE, CAGE & MPSS):
high throughput & precise, ‘digital’ gene expression levels
Most are based on expensive Sanger sequencing technology, & a significant portion of the short tags cannot be uniquely mapped to the reference genome
only a portion of the transcript is analyzed and isoforms are generally indistinguishable from each other
6
Wang et. al, RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genetics 10, 57-63, 2009).
Next generation sequencing (NGS)
Sample preparation
Data analysis:Mapping readsVisualization (Gbrowser)De novo assemblyQuantification
RNA-sequencing
7
RNA-seq vs. microarray
• RNA-seq can be used to characterize novel transcripts and splicing variants as well as to profile the expression levels of known transcripts (but hybridization-based techniques are limited to detect transcripts corresponding to known genomic sequences)
• Detect large dynamic range of expression levels (9,000 fold) compared to microarray (100-few-hundred fold
• RNA-seq has higher resolution than whole genome tiling array analysis
• In principle, mRNA can achieve single-base resolution, where the resolution of tiling array depends on the density of probes
• High levels of reproducibility, for both technical and biological replicates
• RNA-seq can apply the same experimental protocol to various purposes, whereas specialized arrays need to be designed in these cases
• Detecting SNPs (needs SNP array otherwise)• Mapping exon junctions (needs junction array otherwise)• Detecting gene fusions (needs gene fusion array otherwise)
8
RNA-seq and microarray agree fairly well only for genes with medium levels of expression
Saccharomyces cerevisiae cells grown in nutrient-rich media. Correlation is very low
for genes with either low or high expression levels.
9
Advantages of RNA-Seq compared with other transcriptomics methods
10
RNA Seq helps to look at
Alternative gene spliced transcripts
Post-transcriptional modifications
Gene fusion
Mutations/SNPs
Changes in gene expression
• Used to determine exon/intron boundaries
• Verify or amend previously annotated 5’ and 3’ gene boundaries.
• Also includes miRNA, tRNA, and rRNA profiling
11
Alternative splicing
12
Library construction
RNA fragmentation (RNA hydrolysis or nebulization) & cDNA fragmentation (DNase I treatment or sonication)
Bioinformatic challenges
Devt. of efficient methods to store, retrieve and process large amounts of data, which must reduce errors in image analysis and base-calling and remove low-quality reads.
13
Challenges for RNA-Seq
bias at depleted 5′ and 3′ ends
bias at 3′ ends
14
Applications of RNA Seq
• First draft of Arabidopsis thaliana genome sequence (2000); its annotation continues to be improved
• Large amounts of Sanger sequencing-generated EST data provided the initial basis for gene identification and expression profiling
Expensive, time consuming, inherently biased against low-abundance transcripts & are typically enriched in transcript termini
• RNA-seq circumvents these limitations and provides accurate resolution of splice junctions and alternative splicing events
• Arabidopsis transcriptome survey using Illumina shows
- At least 42% of intron-containing genes are alternatively spliced (Filichkin et al., 2010)
- 61% when only multi-exonic genes are sampled
- ~48% of rice genes (Lu et al.,2010)
1. IMPROVING GENOME ANNOTATION WITH TRANSCRIPTOMIC DATA
15
Contd…
• Mining RNA seq data in search of TSS variation is improving gene structure annotation and alternative TSSs have been detected in ∼10,000 loci in Arabidopsis and rice
(Tanaka et al., 2009).
• An ideal genome annotation would identify
Genes that show invariant transcript sequences
Those that exhibit alternative splicing and
Link these events to specific spatial, temporal, developmental, and/or environmental cues.
• Abiotic stress in Arabidopsis can increase or decrease the proportions of apparently unproductive isoforms for some key regulatory genes, supports alternative splicing is an important mechanism in the regulation of gene function
(Filichkin et al., 2010)
16
Contd…
• Polymorphisms between different A. thaliana accessions is one SNP per ∼200 bp.
• Complete re-sequencing of the transcriptomes and annotation of different accessions helps to interpret the functional consequences of polymorphism
• Utilizing genomic and transcriptomic data for in silico gene prediction results in a more reliable annotated genome, with Information on SNPs, indels, splice variants and expression variation
17
Generating genomic and enabling proteomicresources for “non-model” species
• Published plant genome sequences represents very small fraction of plant taxonomic diversity
• Study of “non-model” species challenging
• de novo sequencing of the transcriptome to generate genetic resources1. Eucalyptus (mizrachi et al., 2010)2. Garlic (sun et al., 2012)3. Pea (franssen et al., 2011), 4. Chestnut (barakat et al., 2009)5. Chickpea (garg et al., 2011)6. Olive (alagna et al., 2009)7. Safflower (lulin et al., 20128. Japanese knotweed (Hao et al., 2011).
Gene annotation relies on identifying homologs, & ideally orthologs, in species with an annotated genome (if no appropriate EST databases are available)
If not, A. thaliana genome sequence (Gold std.)
Further confirmation; interrogating additional plant databases
Annotation with pre-existing EST database Eg: melon (Dai et al., 2011)
Same function
different function
18
• De novo RNA-seq to identify genetic polymorphisms (molecular breeding), wherein multiple cultivars or close-related species with variations in traits of interest are sequenced and genetic variation is identified.
Allows generation of molecular markers to facilitate progeny selection and molecular genetics research
Ex: 12,000 SSRs in a single RNA-seq analysis of sesame (earlier only 80 SSRs), on average 1 genic-SSR per ∼8 kb
(Zhang et al., 2012)
5,234 SNPs in transcriptomes of five winter rye inbred lines. Used in a high-throughput SNP genotyping array
(Haseneyer et al., 2011)
• Comparative sequence analysis of radish RNA- seq data and Brassica rapa genome sequence lead to the discovery of 14,641 SSRs
Contd…
(Wang et al., 2012)
19
RNA Seq application to advance the field of proteomics.
• Effective proteome profiling is generally considered to depend heavily on the availability of a high-quality DNA reference database
• High-throughput mass spectrometry-based protein identification relies on the availability of an extensive DNA sequence database in order to match experimentally determined peptide masses with the theoretical proteome generated by computationally translating transcripts
• RNA- seq based transcriptome profiling can provide an effective data set for proteomic analysis of non-model organisms
20
“RNA-Seq, facilitates thematching of peptide massspectra with cognate genesequence”
• To test this, quantitative analysis of the proteomes of pollen from domesticated tomato (Solanumlycopersicum) and two wild relatives
• RNA-Seq (454 pyrosequencing); >1200 proteins were identified
No major qualitative or quantitative differences were observed in the characterized proteomes
either with a highly curated community database of tomato sequences or the RNA-Seq database21
Characterizing temporal, spatial, regulatory, and evolutionary transcriptome landscapes
Temporal transcriptome
• RNA-seq is increasingly being adopted to examine transcriptional dynamics
• Ananalysis of transcriptome of grape berries during three stages of devt. identified >6,500 genes that were expressed in a stage-specific manner (Zenoni et al.,2010)
• Radish >21,000 genes differentially expressed at two developmental stages of roots, includes genes strongly linking root development with starch and sucrose metabolism and with phenylpropanoid biosynthesis.
(Wang et al., 2012)
22
Objective: To understand the molecular mechanisms underlying tuberous root formation and development.
• Radish (R. sativus) cultivar ‘Weixianqing’.
• Samples; cultivar ‘Weixianqing’.
• hypocotyl (1 cm, 7DAS) & true root (1 cm, 20 DAS (RLSS, the stage of cortex splitting), 10 seedlings of each were pooled together
• Illumina paired-end sequencing technology GAII platform (BGI; Shenzhen, China)
• Gene annotation: Comparative genome analysis between radish and Brassica rapa. Unigenes were aligned with sequences in NCBI non-redundant protein (Nr) database, Swiss-Prot protein database, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database & Cluster of Orthologous Groups (COG) database using BLASTxAnnotation by using Blast2GO program
23
• Sequence similarity search was conducted against the NCBI Nr (85.51 %), Nt (90.18%) and Swiss-Prot protein databases (54%) using the BLASTx algorithm
• 21,109 unigenes were assigned GO terms.
Functional annotation of all non-redundant unigenes
Gene Ontology classification of assembled unigenes
(9,271; 43.92 %)
24
Transcript differences between RESS and RLSS
13,453
8,389
To understand the functions of
DEGs, mapped all the DEGs to
terms in the KEGG database &
found 29 pathways were
significantly enriched.
carbohydrate, energy, lipid,
amino acid, other amino acids,
terpenoids and polyketides,
metabolism and biosynthesis
of other secondary metabolites
20 (starch and sucrose metabolism) and 25 (phenylpropanoid
biosynthesis) unigenes significantly up-regulated, play a critical roles in
regulating radish tuberous root formation. Also confirm finding of radishroot is rich with carbohydrates and phenolic compounds. 25
starch and sucrose metabolism
(303) and phenylpropanoid
biosynthesis (177) two
predominant groups
EST-derived SSR detection
26
• Previous gene expression studies using EST sequencing, spotted microarrays & Affymetrix Gene Chip tech. (based on prior sequence)
• Provides only a fragmented picture of transcript accumulation patterns.
• RNA Seq to 7 tissues (leaf, flower, pod, two stages of pod-shell, root, nodule) & 7 seed devt. Stages of BC5F5 plant G. max
• Compare transcript reads with recent genome sequence (assembly Glyma1.01)
• Potential model for future RNA-Seq atlases27
Mapping of short-read sequences:
• Illumina Genome analyser-II: produced 5.8-8.9 mill. 36-bp reads for 7 non-seed tissues & 2.7-9.6 mill. 36-bp reads for seed tissues
• Alignment program GSNAP was used to map the reads to two reference genomes: G. max and Bradyrhizobium japonicum.
• Digital gene expression analysis: 46,430 genes identified as “high confidence” (correlation to full length cDNAs, ESTs, homology, & ab initio methods)
• Of which 41,975 (90.4%) genes were transcriptionally active
Expression and gene structure
Coding regions of transcriptionally inactive genes were smaller and had a lower GC content
28
Hierarchical clustering of transcriptional
profiles in 14 tissues.
Tissue-specific analysis of the soybean transcriptome
Relative expression levels based on Z-score
analysis (3.4-3.6 more tissue specific)
early seed devt. stages late seed devt. stages
aerial tissues underground tissues
Z = (X-μ)/sd
Tissue specific
Tissue specific
29
Heatmap of the Legume Specific GenesHeatmap of top 500 highest expressed genes
Some legume specific genes
have tissue specific transcription
Glyma06g08290
Glyma04g08220 (Oleosin)Glyma02g01590 (lectin precursor 1)
30
General trends in expression profiles for all genes tissue by tissue comparison (Fishers Exact test)
Higher transcriptional level
Importance: Understand Gene functions and molecular process occur during two stages.GOslim analysis; between Seed 25 DAF & seed 28 DAF, seed 35 & 42 DAF, stably expressednutrient reservoir activity & urease activity. Which are imp activity in seed devt. 31
Genes structure and tissue specific gene expression
• Underground tissue have larger first exon, aerial has higher # of exons.
• Significant difference in total transcription length among tissue due to varying intron length.
• No significant difference between GC content and tissue specificity
32
Boxplot Dendrogram of preferential expressed genes in seed development
RPKM normalized log2-
transformed expression gene profiles
33
Summary
• RNA Seq-Atlas provides
• A record of high-resolution gene expression in a set of 14 diverse tissues
• Hierarchical clustering of transcriptional profiles for 14 tissues
• Relationship between gene structure and gene expression
• Tissue-specific gene expression of both the most highly-expressed genes and the genes specific to legumes in seed development and nodule tissues
• A means of evaluating existing gene model annotations for the Glycine max genome
34
Spatial transcriptome
• Most RNA-seq analyses target whole organs, or sets of organs, which inherently prevents the identification of cell or tissue type transcripts, and thus spatially coordinated structural and regulatory gene networks.
• RNA-seq analysis of discrete tissues or cell types: Spatial information and increase the depth of sequence coverage
Ex: >1000 genes have specifically or preferentially expressed in Arabidopsis male meiocytes
• Acquiring tissue or cell-specific samples with any degree of precision and minimal contamination is often technically difficult
35
Methods of isolation of single cells
36
Contd…
• Matas et al. (2011): LCM + RNA-seq (454 pyrosequencing) transcriptomes of 5 principal tissues of the developing tomato fruit pericarp.
~ 21,000 unigenes identified & more than half showed ubiquitous (57%) expression, while other showed cell type-specific expression
• Takacs et al. (2012): LCM + RNA-seq (Illumina-based NGS) study of the ontogeny of maize shoot apical meristem
59% of genes expressed ubiquitously
• A number of mammalian tissues also shown a high proportion of ubiquitously expressed transcripts
“These studies may indicate that this is a common feature of eukaryotes” (Ramsköld et al., 2009).
37
To study plant responses and adaptations to abiotic and biotic stresses
Aim: Elucidate genes and gene networks that contribute to sorghum’s tolerance to water-limiting environments with a long-term aim of developing strategies to improve plant productivity under drought Discovered >50 previously unknown drought- responsive genes.
38
Up Down
ABA ~2,300 ~2,600
PEG ~1,650 ~700
20 μM
8th day,
57.1 μM
20% PEG-8000
Transcript Analysis in Response to ABA and Osmotic Stress
Method
ABA in response to plant
stress, and its central role
in other pathways, (dormancy in leaf & seed)
LEA proteinWSI18 proteindehydrin
sugar
substratetransporter
peroxidase 6
39
A, brassinosteroid biosynthesisB, cytokinins degradationC, cytokinins glucoside biosynthesisD, ent-kaurene biosynthesisE, ethylene biosynthesis from methionine
F, gibberellin biosynthesisG, gibberellin inactivationH, IAA conjugate biosynthesisI, jasmonic acid
Networks of hormone pathways in ABA-treated plants
Shoots Roots
Box=Hormone-related
Circle=non-hormone-related
Down Up DE genesDark blue solid lines= ≥10 blue
long-dashed lines=6-9 light blue short-dashed= ≤5
Only the brassinosteroid and JA biosynthesis pathways, and cytokinin glucoside and IAA
conjugate biosynthesis pathways are directly connected via DE genes.
Indirect ‘cross-talk’ between the various hormones in response to osmotic stress and ABA40
Determining the genes of unknown
function that respond to drought orABA treatment across species
Decision tree used to determine which
genes and their orthologs were regulated by
drought/ABA across different species
Overlap of drought-responsive sorghum
genes of unknown function that had drought-
responsive orthologs of unknown function in
other species
41
(51) (82)
(183)
• RNA-seq used reveal massive changes in metabolism and cellular physiology of the green alga Chlamydomonas reinhardtii when the cells become deprived of sulfur
• studies of plant responses to pathogens
Ex: sorghum Bipolaris sorghicola(Mizunoetal.,2012)
• Complexities of the metabolic pathways associated with plant defense mechanisms
42
Study plant evolution and polyploidy.
• A comparison of the leaf transcriptome of an allopolyploid relative of soybean with two species that contributed to its homoelogous genome, allowed the determination of the contribution of the different genomes to the transcriptome (Ilut et al., 2012)
• Maize endosperm trascriptome analysis; discovered 179 imprinted genes and 38 imprinted long ncRNAs (Zhang et al., 2011)
• Transcriptome of 9 distinct tissues of three species of the Poaceaefamily (Brachypodium, sorghum & rice) to determine whether orthologous genes from these three species exhibit the same expression patterns (Davidson et al., 2012)
Only a fraction of orthologous genes exhibit conserved expression patterns
Orthologs in syntenic genomic blocks are more likely to share correlated expression patterns compared with non-syntenicorthologs.
These findings are important for crop improvement (seq transfer)
43
Hierarchical clustering of 27 tissues (9 tissues x 3
species) based on correlations of log2 FPKM mappedexpression values of 3-taxa single-copy (3x3) genes
Classification of Brachypodium, rice, andsorghum genes into orthologous groups
clustering of
corresponding tissue
extensive expression divergence within 3 · 3 genes
Red: single copy (2 X 2 & 3 X 3)
Black: multicopy (2 X N & 3 x N)
OrthoMCL 44
Genes within each k-means co-expression
cluster were categorized based on OrthoMCL
category assignments or as lineage-specific
single-copy (1 x 1) genes.
Co-expression analyses identify conservation
of expression among orthologs and paralogs
Proportions of genes with at least one corresponding paralog or ortholog in the same cluster
Portion of Poaceae orthologs and paralogs share same expression patterns across reproductive tissuesSome genes exhibited different expression phenotypes
45
Similar expression pattern in Poacea
which biological processes were over-represented in orthologs/paralogs category ?
Ortholog/paralogs
Gene ontology (GO) annotation
2 x N genes Stress-related functions (‘response to biotic stimulus’, ‘defense response’, ‘apoptosis’), lipid transport, secretion (‘exocytosis’), and general oxidation–reduction reactions.
3 x N (higher substitution rates)
Core metabolic functions; ‘translation, ATP biosynthesis, nucleosome assembly, and biosynthetic process & oxidation–reduction, response to wounding, sexual reproduction.
3 x 3 genes Essential functions: regulation of transcription’ (>1000 genes),‘protein folding’ (253 genes), ‘intracellular protein transport’(123 genes), and ‘glycolysis (91 genes)
2 x 2 genes protein amino acid phosphorylation, ‘regulation of transcription & response to oxidative stress
46
Relationship between synteny and expression patterns of orthologs
Syntenic gene pairs within collinear blocks of at
least five genes were identified for all pairwise
combinations of three Poaceae species
Distributions of Pearson’s correlation coefficients (PCC)
synteny plays a significant role in
evolution of gene expression, especially
in the case of duplicate and multicopy
genes 47
Identifying and characterizing novel non-coding RNAs
• Insilico analysis provides a rapid way to identify putative sRNAgenes
• RNA-seq technology represents an excellent means for sRNAdiscovery and validation
• Characterization of miRNAs regulatory functions to be facilitated by determining tissue-specific expression pattern
• RNA-seq was used to identify sRNAs from five Arabidopsis root tissues.
Some sRNAs expressed in all 5 tissues while others were tissue and developmental zone specific
• The frequency of alternative slicing at miRNA binding sites is significantly higher than that at other regions, suggesting that alternative splicing is a significant regulatory mechanism.
• sRNAs have been recently characterized in the context of association with epigenome modifications, including cytosine methylation of genomic DNA
48
From co-expression networks to integrative data analysis
• Sequencing whole transcriptomes provides a high degree of detail, but deriving useful biological information from a long list of expressed genes is typically not trivial
• Construct networks of co-expressed genes and to use gene ontology (GO) information to help highlight important gene candidates as critical components of functional networks
• Gene ontology enrichment analysis of RNA-seq data often illustrates the complexity of interacting pathways
Robust Functional networks
Transcriptome: RNA-seq
proteomics
metabolomics
No correlation
ex: Soybean
protein XCorrelation
Ex:Oil plam
mesocarp Fatty
acid
49
Bulked Segregant RNA-Seq
SNP 2 being closely related to the mutation to map
linkage disequilibrium between markers and causal gene is determined by quantifying the allelic frequencies between two samples
advantages:(i) Having a reference genome is not
a prerequisite(ii) Markers can be generated from
the experimental data(iii) Differential expression profiles (iv) Info on effects of mutant on
global patterns of gene expression
(v) Provide map position of a gene
Liu et al. (2012)
BSA requires polymorphic markers
50
>64,000 SNPs
Two alleles of a given SNP site should bedetected in approximately equal numbers ofRNA-Seq reads when considering both pools ofRNASeq data.
Only one allele of a SNP that is completelylinked to the causal gene should be presentamong the RNA-Seq reads from the mutantpool
In practice, as a consequence of Allele SpecificExpression and sampling bias, genes expressedat low levels, single allele of many SNPs aredetected in the mutant pool.
Empirical Bayesian approach used to estimatelinkage probability, i.e. probability of a SNPexhibiting complete linkage disequilibriumwith the causal gene. 51
>64,000 SNPs
Two alleles of a given SNP site should bedetected in approximately equal numbers ofRNA-Seq reads when considering both pools ofRNASeq data.
Only one allele of a SNP that is completelylinked to the causal gene should be presentamong the RNA-Seq reads from the mutantpool
In practice, as a consequence of Allele SpecificExpression and sampling bias, genes expressedat low levels, single allele of many SNPs aredetected in the mutant pool.
Empirical Bayesian approach used to estimatelinkage probability, i.e. probability of a SNPexhibiting complete linkage disequilibriumwith the causal gene. 52
>64,000 SNPs
Two alleles of a given SNP site should bedetected in approximately equal numbers ofRNA-Seq reads when considering both pools ofRNASeq data.
Only one allele of a SNP that is completelylinked to the causal gene should be presentamong the RNA-Seq reads from the mutantpool
In practice, as a consequence of Allele SpecificExpression and sampling bias, genes expressedat low levels, single allele of many SNPs aredetected in the mutant pool.
Empirical Bayesian approach used to estimatelinkage probability, i.e. probability of a SNPexhibiting complete linkage disequilibriumwith the causal gene.
gl3-ref allele in a non-B73/B73
53
Reference genome
The top 10 windows with the highest median
linkage probability were located at physicalposition ,183.5–185.2 Mb.
Fine mapping of gene.1.Mutant gene expression will often be down-regulated compared to the WT pool.2. Collections of SNPs tightly linked to mutant gene 3. SNPs linked to mutated gene can be used for gene cloning via chromosome walking.
54
• Not necessary to use tissue with mutant gene expression for BSR-Seq.
• However, if we collect tissue with expression we can also get additional expression data.
Resolution of mapping depends on1. # of individuals included in the bulks2. Sequencing depth3. Density of polymorphisms in mapping population
55
• International multi-disciplinary consortium; 1,000 plant sps. transcriptome data
• It is PPP project; funding of 75% from Govt. of Alberta, 25% by MuseaVentures. BGI-Shenzhen- sequencing at reduced costs & iPlant collaborative -computational informatics.
• Objectives:
1. Resolve many of the lingering uncertainties in species relationships, especially in the early lineages of streptophyte green algae and land plants
2. To identify gene changes associated with the major innovations in Viridiplantae evolution, such as multi-cellularity, transitions from marine to freshwater or terrestrial environments, maternal retention of zygotes and embryos, complex life history involving haploid and diploid phases, vascular systems, seeds and flowers
• Species selection; representations of all major lineages across the Viridiplantae(green plants), representing ~1 billion years of evolution, including flowering plants, conifers, ferns, mosses and streptophyte green algae.
56
Resources available
1. Access to raw and processed data:
Content; transcriptome assemblies, putative coding sequences, orthogroups and gene and species trees with related sequence alignments.
2. High performance computing and cloud-based services:
iPlant discovery environment (DE) web interface (tutorials and teaching materials available)
57
Phenylpropanoid synthesis pathway for Colchicum autumnale. Labelled rectangles are
proteins. Small circles are metabolites. Black lines show the KEGG pathway. Red lines show the
BioGRID interactions emanating from protein (K12355), which was interactively selected. A right-
click on the protein will display the inferred function and a link to the sequence(s)
Interactions & pathways
58
Conclusion
• RNA-sequencing is now well-established as a versatile platform with applications in an ever growing number of fields of plant biology research
• Ongoing developments in sequencing technologies, such as increased read lengths, greater numbers of reads per run
• Advanced computational tools to facilitate sequence assembly, analysis, and integration with orthogonal data sets will further accelerate the breadth and frequency of its adoption by plant scientists
59