an introduction to rna-seq transcriptome profiling with iplant
TRANSCRIPT
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
Before we start: Align sequence reads to the reference genomeThe most time-consuming part of the analysis is doing the alignments of the reads (in Sanger fastq format) for all replicates against the reference genome.
Overview: This training module is designed to demonstrate a workflow in the iPlant Discovery Environment using RNA-Seq for transcriptome profiling.
Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?
RNA-seq in the Discovery Environment
Scientific Objective
LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF).
Mutations cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response.
We will use RNA-seq to compare WT and hy5 to identify HY5-regulated genes.
Source: http://www.gla.ac.uk/media/media_73736_en.jpg
Samples
• Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM613465 and GEO:GSM613466)
• Two replicates each of RNA-seq runs for Wild-type and hy5 mutant seedlings.
RNA-Seq Conceptual Overview
Image source: http://www.bgisequence.com
RNA-seq Sample Read Statistics
• Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/).
• Reads retained by TopHat are shown below
Sequence run WT-1 WT-2 hy5-1 hy5-2
Reads 10,866,702 10,276,268 13,410,011 12,471,462
Seq. (Mbase) 445.5 421.3 549.8 511.3
RNA-Seq Data
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
…Now What?
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/[email protected] HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:[email protected] HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Bioinformatician
0100110
10 1
The Tuxedo Protocol
$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq
$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam
$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt
$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam
Your RNA-Seq Data
Your transformed RNA-Seq Data
RNA-Seq Analysis Workflow
Tophat (bowtie)
Cufflinks
Cuffmerge
Cuffdiff
CummeRbund
Your Data
iPlant Data Store
FASTQ
Disco
very E
nviro
nm
en
t Atm
osphe
re
The iPlant Discovery Environment
The iPlant Discovery Environment
The iPlant Discovery Environment
The iPlant Discovery Environment
Import SRA data from NCBI SRA
Extract FASTQ files from the
downloaded SRA archives
Getting the RNA-Seq Data
Staged Data
Examining Data Quality with fastQC
Tophat
Tophat in the Discovery Environment
Align the four FASTQ files to Arabidopsis genome using Tophat
Align Reads to the Genome
TopHat
• TopHat is one of many applications for aligning short sequence reads to a reference genome.
• It uses the BOWTIE aligner internally.
• Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc.
ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutantBackground (> 9-fold p=0). Compare to gene on right lacking differential expression
Assembling the Transcripts
Cufflinks in the Discovery Environment
Cufflinks
Merging the Transcriptomes
Cufffmerge in the Discovery Environment
Cuffmerge
Comparing wild-type to hy5 transcriptomes
Cuffdiff in the Discovery Environment
Cuffdiff
Cuffdiff Results
Differentially expressed genes
Example filtered CuffDiff results generated with the Filter_CuffDiff_Results to1)Select genes with minimum two-fold expression difference2)Select genes with significant differential expression (q <= 0.05)3)Add gene descriptions
Density Plot
Scatter Plot
Volcano Plot
Expression Plots
Cloud Computing with iPlant Atmosphere
Launch a Virtual Server (in the Cloud!)
You now have your very own virtual linux server
Expression Plots: Open a terminal and launch R
Expression Plots: Demonstration