kogo 2013 rna-seq analysis
DESCRIPTION
RNA-seq analysisTRANSCRIPT
RNA-SEQ ANALYSIS고준수, 송상훈, 김현민
테라젠 바이오 연구소2012. 2. 5
CONTENTS• NGS
• RNA-seq
• File Forat
• Workflow
• Preparation
• Filtering & QC
• Mapping
• PCR Duplication
• Expression
• DEG
• Report
TODAY’S KEYWORDS
NGSIllumina, Paired-End
RNA-seqmRNA, Reference-based
MappingTopHat
ExpressionCufflinks, Cuffmerge
DEGCuffdiff, DESeq
DesignReplicates
File FormatFastq, BAM
NEXT-GENERATION SEQUENCING
SEQUENCING
Sanger (1st Generation)
NEXT-GENERATION SEQUENCING
Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/nrg2626. Epub 2009 Dec 8. Sequencing technologies - the next generation. Metzker ML.
2nd Generation
3rd Generation
NGS WEAKNESS AND OVERCOMING
Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012 Jul 24;13:341.
Sanger 0.001%
Nature Biotechnology 26, 1135 - 1145 (2008), Next-generation DNA sequencing, Shendure J. and Ji H.
NGS
http://users.ugent.be/~avierstr/nextgen/nextgen.html
Library Construction Sequencing
RawReads
GENERAL NGS ANALYSIS PROCESS
Shearer AE, Hildebrand MS, Sloan CM, Smith RJ. Deafness in the genomics era. Hear Res. 2011 Dec;282(1-2):1-9. doi: 10.1016/j.heares.2011.10.001. Epub 2011 Oct 8.
Mapping1 WGS
Low depth < NT < High depth
3
Depth(Coverage)
2
Coverage
Speed
MAPPING TOOLS• Mapper Type
• DNA• RNA• miRNA• bisulphite
Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.
BWAfor
WGS
TopHatfor RNA
PCR DUPLICATIONhttp://www.clcbio.com/clc-plugin/duplicate-reads-removal-plugin/
remove
ILLUMINA PAIRED-END
Quinlan AR, Boland MJ, Leibowitz ML, Shumilina S, Pehrson SM, Baldwin KK, Hall IM. Genome sequencing of mouse induced pluripotent stem cells reveals retroelement stability and infrequent DNA rearrangement during reprogramming. Cell Stem Cell. 2011 Oct 4;9(4):366-73. doi: 10.1016/j.stem.2011.07.018.
Haas BJ, Zody MC.Advancing RNA-Seq analysis.Nat Biotechnol. 2010 May;28(5):421-3. doi: 10.1038/nbt0510-421.
mate-pair inner distnace
http://vallandingham.me/RNA_seq_differential_expr
ession.html
http://users.ugent.be/~avierstr/nextgen/nextgen.html
fastq_1
fastq_2
SUMMARY• NGS platform : Short Reads, Depth, Coverage
• Sequencing Protocol
• Analysis Protocol
• Mapping
• PCR duplication
• Illumina Paired-end
TRANSCRIPTOMERNA-SEQ
TRANSCRIPTOME
• The complete set of transcripts in a cell, and their quantity
• The key aims of transcriptomics are:
• to catalogue all species of transcript, including mRNAs, non-coding RNAs and small RNAs
• to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications
• to quantify the changing expression levels of each transcript during development and under different conditions.
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.
ADVANTAGES OF RNA-SEQ
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.
RNA-SEQ & MICROARRAY
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.
RNA-SEQ• Gene expression level
• Relative expression level in sample
• Differentially expressed gene
• Identification of alternative spliced transcripts
• Prediction of novel transcripts
• Gene Fusion
Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.
RNA-SEQ VS. DNA-SEQ
RNA-seq DNA-seq
Methods Reference-based,de novo assembly
WES,WGS re-sequencing,
WGS de novo
Goal
Expression,Differentially Expressed Genes,
Novel transcript,Alternative splicing form,
Gene fusion
SNPs, Indels, SV
Measure Mapped Read Count Base accuracy
OVERVIEW OF A TYPICAL RNA-SEQ
RNA MAPPING
Trapnell C, Salzberg SL., How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.
Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010;11(12):220.
MAPPERMapper Data Seq.Plat. Input Output Cit. Cit/years Reference
MapSplice RNA I FASTA/Q SAM, BED 50 28.17 Wang et al. (2010)
MicroRazerS miRNA N FASTA/Q SAM, TSV 7 2.75 Emde et al. (2010)
mrFAST miRNA I FASTA/Q SAM 158 58.34 Alkan et al. (2009)
mrsFAST miRNA I,So FASTA/Q SAM 32 18.03 Hach et al. (2010)
Passion RNA I,4,Sa,P FASTA/Q BED - - Zhang et al. (2012)
PatMaN miRNA N FASTA TSV 38 9.36 Prufer et al. (2008)
QPALMA RNA I,4 Specific TSV 75 21.11 De Bona et al. (2008)
RNA-Mate RNA So CFASTA BED, Counts 28 10.04 Cloonan et al. (2009)
RUM RNA I,4 FASTA/Q SAM,TSV,BED 2 2.36 Grant et al. (2011)
SOAPSplice RNA I,4 FASTA/Q TSV 3 3.54 Huang et al. (2011)
SpliceMap RNA I FASTA/Q SAM, BED 63 29.80 Au et al. (2010)
Supersplat RNA N FASTA TSV 21 9.93 Bryant Jr et al. (2010)
TopHat RNA I FASTA/Q, GFF BAM 389 121.04 Trapnell et al. (2009)
Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.
The number of citations (Cit.) was obtained from Google Scholar on April 14, 2012
ANALYSIS STRATEGIESReference-based de novo
Method•Using a reference genome•The transcriptome assembly can be built upon it
•not use a reference genome
Adv.
• Contamination or sequencing artefacts are not a major concern• Very sensitive and can assemble transcripts of low abundance • To discover novel transcripts that are not present in the current annotation
• Not depend on a reference genome• Not depend on the correct alignment of reads to known splice sites or the prediction of novel splicing sites• Trans-spliced transcripts can be assembled
Disadv. • Depends on the quality of the reference genome being used.
• Computing resources• Senstive to sequencing errors
Depth ~ 10x > 30x
Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.
REFERENCE-BASED
Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.
REFERENCE-BASED
Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.
SUMMARY
• Transcriptome
• RNA-seq advantages
• Process
• Analysis strategies
• Reference-based method
NGS FILE FORMAT
FILE FORMAT• NGS
• Fastq
• SAM/BAM
• VCF
• Reference
• Fasta
• GTF / GFF
S01_1.fq
S01_2.fq
FASTQ FORMAT• de factor standard file format for raw reads
• fq, fastq, fq.gz, fastq.gz 1: @title identifier description
2: Sequence
3: + description
4: Quality valuesPaired-end
Sequencer
Fastq
QUALITY SCORE• The base-calling error probabilities.
• Types
• Pred33 / Illumina 1.8+• Score 0~60• ASCII 33 ~ 126
• Solexa / Illumina 1.0• -5~62• ASCII 56 ~ 126
• Pred64 / Illumina 1.3 ~ 1.5• 0 ~ 62 • ASCII 64 ~126
http://www.asciitable.com
SAM / BAM FORMAT• SAM stands for Sequence Alignment/Map format.
• TAB-delimited text format
• 11 mandatory fields
Sequencer
Fastq
Mapper
SAM/BAM
Read Name
FlagReference
Position
QualityPos. of Mate
Length
SAM / BAMFlag
CIGAR
SAM
TOOLS FOR SAM/BAM• Samtools
• index
• view
• sort
• faidx
• flagstat
• tview
• mpileup
• Picard
• SortSam
• MarkDuplicates
• ......
GTF (ENSEMBL)
protein_coding, mtRNA, miRNA, lincRNA, pseudogene......
Gene ID Transcript ID
SUMMARY• Fastq format
• de facto standard
• Quality Score
• Pred33/Illumina 1.8+, Illumina 1.0, Pred64/Illumina 1.3~1.5
• SAM/BAM format
• GTF
WORKFLOW
REFERENCE
REFERENCE WORKFLOW
TopHat Cufflinks Cuffmerge Cuffdiff
Sample1
Sample2
Mappedreads
Mappedreads
Assembledtranscripts
Assembledtranscripts
Finaltranscriptome
assembly
Differentialexpression
results
CummeRbundExpressionplots
PicardSamtools
RSeQCFastQC
cummeRbundGO
CuffdiffDEGseqDESeq
CufflinksHTseq-count Cuffmerge
TopHatRUMBWA
Bowtie2
TBI-toolkit
OUR WORKFLOW
FilteringRead
MappingGene
StructureExpression
Level
DEGanalysis
Report
UniProtGO
KEGG
Annotation
Samples Reference Geneset
Duplication
PREPARATION
S01.fq.gz, S02.fq.gz
chr.fa, ens.gtf, mask.gtf
DIRECTORY/KOGO/RNA-seq ref
inputs
outputs S01
......
merged_asm
accepted_hits.bam, transcripts.gtf
Diff-S01-S02
merged.gtf, transcripts.gtf
gene_exp.diff, isoforms_exp.diffscripts
accepted_hits.bam, transcripts.gtf
Tools
SAMPLES
S01 S02
S03 S04
Horse 1
Horse I1
운동전 운동후
TOOLSCategory Programs Version Homepage
QC FastQC 0.10.1 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
MapperBowtie2 2.0.5 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
MapperTopHat 2.0.7 http://tophat.cbcb.umd.edu
Abundance
Cufflinks 2.0.2 http://cufflinks.cbcb.umd.edu
Abundance HTseq-count - http://www-huber.embl.de/users/anders/HTSeq/doc/count.htmlAbundance
DESeq 1.10.1 http://bioconductor.org/packages/release/bioc/html/DESeq.html
Annotation goseq 1.10.0 http://www.bioconductor.org/packages/2.11/bioc/html/goseq.html
Tools
samtools 0.1.18 http://samtools.sourceforge.net
Tools
picard 1.83 http://picard.sourceforge.net
Tools TBI-toolkit 0.1 http://dev.totalomics.kr/Tools
R 2.15.0 http://www.r-project.org
Tools
Gnuplot - http://www.gnuplot.info
TBI-TOOLKIT• TBI NGS Toolkit
• http://dev.totalomics.kr
• Application
• TBI-toolkit-qscore
• TBI-toolkit-fq_filter
• TBI-toolkit-gtf_selector
• TBI-toolkit-fa_spliter
• TBI-toolkit-make_matrix
REFERENCE• Reference-based strategy
Name FileType Description
Reference fasta Genome Sequence
Geneset GTF2.2/GFF3 Reference Geneset
Name Source Description
Mask Geneset Geneset Geneset that has ncRNA information.(rRNA, tRNA, and other ncRNA)
Bowtie2 Index Reference Index files for running bowtie2
GO information GO Gene ontology information for GO enrichment
Optional
REFERENCE SOURCE• Ensembl (http://www.ensembl.org)
• General file format for all species
• Geneset (GTF format)
• Constant Database schema for all species
• Comprehensive Annotation (GO, InterPro, Pfam, Prosite Smart, ...... )
• Automated Update
• UCSC (http://genome.ucsc.edu)
• Semi general file format for all species
• Semi constant Database schea for all species
• Gene table dump (BED format compatible)
• Annotation (Pfam, Kegg)
• Comparative Analysis
• NCBI
• Raw data bank
• GFF type geneset file
ENSEMBLensembl.org plants.ensembl.org fungi.ensembl.org
metazoa.ensembl.org protists.ensembl.org bacteria.ensembl.org
ENSEMBL• Homo Sapiens ( ftp://ftp.ensembl.org/pub/release-69 )
• fasta/homo_sapiens/
• dna/Homo_sapiens.GRCh37.69.dna.toplevel.fa.gz
• dna/Homo_sapiens.GRCh37.69.dna.chromosome.1.fa.gz
• cdna/Homo_sapiens.GRCh37.69.cdna.all.fa.gz
• gtf/homo_sapiens/Homo_sapiens.GRCh37.69.gtf.gz
• mysql/homo_sapiens_core_69_37/
• Arabidopsis thaliana ( ftp://ftp.ensemblgenomes.org/pub/release-16/plants )
• fasta/arabidopsis_thaliana
• dna/Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa.gz
• cdna/Arabidopsis_thaliana.TAIR10.16.cdna.all.fa.gz
• gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.16.gtf.gz
• mysql/arabidopsis_thaliana_core_16_69_10/
chr.fa
ens.gtf
PRE-PROCESSING• Check quality score type of input file
• Reference files
• Reference index
• Mask geneset
SAMPLE QUALITY SCORE
Run)$ cd /KOGO/RNA-seq/inputs$ TBI-toolkit-qscore S01_1.fq.gzSanger(Phred33) or Illumina 1.8+
0 to 93 using ASCII 33 to 1260:1, 1:”, 2:#, 3:$, 4:%, 5:&, ......
Usage)$ TBI-toolkit-qscore [FASTQ]Sanger(Phred33) or Illumina 1.8+
0 to 93 using ASCII 33 to 126
REFERENCE INDEX
Usage)$ bowtie2-build [options] <reference_in> <bt2_base>
Run)$ cd /KOGO/RNA-seq/ref$ bowtie2-build chr.fa chr.fa$ lschr.fa.1.bt2 chr.fa.2.bt2 ......
Index for bowtie2 mapper
Usage)$ samtools faidx <ref.fasta>
Run)$ cd /KOGO/RNA-seq/ref$ samtools faidx chr.fa$ lschr.fa.fai
Fasta index
MASK GENESET
Run)$ cd /KOGO/RNA-seq/ref$ TBI-toolkit-gtf_selector ens.gtf mask.gtf tRNA rRNA Mt_tRNA Mt_rRNA
Usage)$ TBI-toolkit-gtf_selector [IN GTF] [OUT GTF] [Source 1] [Source 2] ......
...... We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.
cufflinks manuals (http://cufflinks.cbcb.umd.edu/manual.html)
SUMMARY• Directory
• /KOGO/RNA-seq
• Tools
• Reference
• Pre-processing
FILTERING & QC
RSeQCFastQC
Filtering ReadMapping Gene
StructureExpression
Level
DEGanalysis
Report
Annotation
Duplication
FILTERING & QC• Improving assembly accuracy
• Removing artifacts
• Sequencing adaptor
• Low quality reads
• Near-identical reads
• PCR amplification
• rRNA and other RNA
• Applications
• Filtering - TBI-toolkit, fastx-toolkit
• QC - FastQC, SolexaQC, RSeQC
Filtering
Mapping
GeneStructure
Expression
DEG
Report
Annotation
Duplication
QUALITY CONTROL• FastQC ( v0.10.1 )
• A quality control tool for high throughput sequence data.
• Java
• http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• RSeQC
• RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data
• http://code.google.com/p/rseqc/
FASTQC
Usages)$ fastqc seqfile1 seqfile2 .. seqfileN$ fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN
Arguments-f format bam,sam,bam_mapped,sam_mapped and fastq-t threads
Run)$ cd /KOGO/RNA-seq/inputs$ fastqc -f fastq -t 2 S01_1.fq.gz S01_2.fq.gz
Output)$ firefox R01_1.fq_fastqc/fastqc_report.html$ firefox R01_2.fq_fastqc/fastqc_report.html
FASTQCPer Base Sequence Quality Per Sequence Quality Scores Per Base Sequence Content Per Base GC Content
Per Sequence GC Content Per Base N Content Sequence Length Distribution Duplicate Sequences
RSEQC
READ FILTERING (CUTOFF)
RNA-seq DNA-seq
LowQuality
N > 10%Average QV < Q20NT (<Q20) > 40%
N > 10%Average AV < Q20 NT (<Q20) > 5%
TrimmingNo trimming
orTrimming
Trimming
FILTERINGUsages)$ TBI-toolkit filter [option*] seqfile_1 seqfile_2 output_1 output_2
Option)-n N_ratio-a integer : Average QV of read-m NT_ratio < QV
Run)$ cd /KOGO/RNA-seq/inputs$ TBI-toolkit-fq_filter -n 0.1 -m 0.4 -a 20 S01_1.fq.gz S01_2.fq.gz S01_Q20_1.fq.gz S01_Q20_2.fq.gz$ lsS01_Q20_1.fq.gz S01_Q20_2.fq.gz S01_Q20.log S01_Q20.err$ cat S01_Q20.log$ less S01_Q20.err
FASTQC
Run)$ cd /KOGO/RNA-seq/inputs$ fastqc -f fastq -t 2 S01_Q20_1.fq.gz S01_Q20_2.fq.gz
SUMMARY• Read Quality
• FastQC
• RSeQC
• Filter
MAPPING READS(TOPHAT)
RSeQCFastQC
Filtering ReadMapping Gene
StructureExpression
Level
DEGanalysis
Report
Annotation
Duplication
TOPHAT• TopHat is a fast splice junction mapper for RNA-
Seq reads.
• It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
Filtering
Mapping
GeneStructure
Expression
DEG
Report
Annotation
Duplication
Trapnell C, Salzberg SL. How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.
USAGEUsage$ tophat [options] <bowtie_index_base> <reads1_1> <reads1_2>
Option Value Description
-o/--output-dir string The default is "./tophat_out".
-p/--num-threads int Use this many threads to align reads. The default is 1.
-r/--mate-inner-dist int This is the expected (mean) inner distance between mate pairs.The default is 50bp
--mate-std-dev int The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.
--library-typefr-unstranded
fr-firststrandfr-secondstrand
fr-unstranded : Standard Illuminafr-firststrand : dUTP, NSR, NNSRfr-secondstrand : Ligation, Standard Solid
--solexa-quals - Use the Solexa scale for quality values in FASTQ files.
--solexa1.3-quals - Phred64/Illumina 1.3~1.5
-G/--GTF Geneset Geneset (GTF 2.2 or GFF3 formatted file)
--rg-id string Read group ID
--rg-sample string Sample ID
RUN$ cd /KOGO/RNA-seq/outputs$ tophat -o S01 -p 1 -r 170
--library-type fr-unstranded -G ../ref/ens.gtf --rg-id S01_Q20 --rg-sample S01_Q20../ref/chr.fa ../inputs/S01_Q20_1.fq.gz ../inputs/S01_Q20_2.fq.gz
Category Option Value
Output -o/--output-dir /KOGO/RNA-seq/outputs/S01
Thread -p/--num-threads 1
Inner Distance Mean -r/--mate-inner-dist 170
Inner distance SD. --mate-std-dev 20 (default)
Library Type --library-type fr-unstranded (Standard Illumina)
Quality Score Phred33 (default)
Geneset -G/--GTF /KOGO/RNA-seq/ref/ens_69.gtf
Read Group --rg-id--rg-sample S01_Q20
check
ALGORITHM
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11.
TOPHAT• Two step method
• Extracting the transcript sequences and using Bowtie to align reads to this virtual transcriptome first.
• Only the reads that do not fully map to the transcriptome will then be mapped on the genome.
• Optimized for reads >= 75bp
• The values in the first column of the provided GTF/GFF file must match the name of the reference sequence in the Bowtie index you are using with TopHat.
OUTPUT
Filename Types Description
accepted_hits.bam BAM A list of read alignments in SAM format.Coordinate-sorted
unmapped.bam BAM A list of unmapped read in SAM format.
junctions.bed UCSC BED A track of junctions reported by TopHat
insertions.bed UCSC BED chromLeft referes to the last genomic base before the insertion
deletions.bed UCSC BED chromLeft referes to the first genomic base before the insertion
SIMPLE ALIGNMENT VIEWUsage$ cd /KOGO/RNA-seq/output/S01$ samtools index accepted_hits.bam$ samtools tview accepted_hits.bam ../../ref/chr.fa
Key Desc? This window
Arrows Small scroll movement
H,J,K,L Large scroll movement
space Scroll one screen
backspace Scroll back one screen
g Go to specific location
m Color for mapping qual
n Color for nucleotide
b Color for base quality
. Toggle on/off dot view
q Exit
25:413751
MAPPING STATISTICSRun)$ cd /KOGO/RNA-seq/outputs/S01$ samtools flagstat accepted_hits.bam
Run)$ cd /KOGO/RNA-seq/outputs/S01$ bam_stat.py -i accepted_hits.bam
45338688 + 0 in total (QC-passed reads + QC-failed reads)0 + 0 duplicates45338688 + 0 mapped (100.00%:-nan%)45338688 + 0 paired in sequencing22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)
Total Reads (Records): 45338688
QC failed: 0Optical/PCR duplicate: 0Non Primary Hits 1861695Unmapped reads: 0Multiple mapped reads: 586067
Uniquely mapped: 42890926Read-1: 21527100Read-2: 21363826Reads map to '+': 21457407Reads map to '-': 21433519Non-splice reads: 32872272Splice reads: 10018654Reads mapped in proper pairs: 38402964
SUMMARY• TopHat
• Splice junction
• Geneset
• Two step method
• accepted_hits.bam
PCR DUPLICATES(OPTIONAL)
RSeQCFastQC
Filtering ReadMapping Gene
StructureExpression
Level
DEGanalysis
Report
Annotation
Duplication
PCR DUPLICATION Filtering
Mapping
GeneStructure
Expression
DEG
Report
Annotation
Duplication
Run) $ cd /KOGO/RNA-seq/outputs/S01/$ samtools rmdup accepted_hits.bam accepted_hits.rmdup.bam
• Removing reads that have same mapping coordinates.
• Tools
• samtools - rmdup
• Picard - MarkDuplicates
Run) $ cd /KOGO/RNA-seq/outputs/S01/$ java -jar /KOGO/RNA-seq/Tools/Picard/MarkDuplicates.jar
INPUT=accepted_hits.bam OUTPUT=accpted_hits.mark_dup.bamASSUME_SORTED=true REMOVE_DUPLICATES=trueMETRICS_FILE=accpeted_hits.metric
PCR DUPLICATION
accepted_hits.bam samtools Picard (Mark) Picard (Remove)
45338688 + 0 in total0 + 0 duplicates45338688 + 0 mapped45338688 + 0 paired22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)
29259330 + 0 in total0 + 0 duplicates29259330 + 0 mapped29259330 + 0 paired14717809 + 0 read114541521 + 0 read224471885 + 0 properly paired (83.64%:-nan%)26229602 + 0 with itself and mate mapped3029728 + 0 singletons (10.35%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)
45338688 + 0 in total17717244 + 0 duplicates45338688 + 0 mapped45338688 + 0 paired22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)
27621444 + 0 in total0 + 0 duplicates27621444 + 0 mapped27621444 + 0 paired13820471 + 0 read113800973 + 0 read224945306 + 0 properly paired (90.31%:-nan%)26660814 + 0 with itself and mate mapped960630 + 0 singletons (3.48%:-nan%)655922 + 0 with mate mapped to a different chr52614 + 0 with mate mapped to a different chr (mapQ>=5)
EXPRESSION(CUFFLINKS)
RSeQCFastQC
Filtering ReadMapping Gene
StructureExpression
Level
DEGanalysis
Report
Annotation
Duplication
EXPRESSINO & MODELING
Adam Roberts et al., Iden%fica%on of novel transcripts in annotated genomes using RNA-‐Seq. Bioinforma4cs, 2011, 27:2325–2329
NORMALIZATION• Read counts need to be properly normalized to extract meaningful
expression estimates
• First, RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample
• Second, the variability in the number of reads produced for each run causes fluctuations in the number of fragments mapped across samples
Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 Jun;8(6):469-77.
RPKM
• C : the number of mappable reads that fell onto the gene’s exons
• N : the total number of mappable reads in the experiment
• L : the sum of the exons in base pairs
the reads per kilobase of transcript per million mapped reads
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):621-628.
Relative Expression Level in
Sample
CUFFLINKS• Cufflinks assembles transcripts, estimates their
abundances, and tests for differential expression and regulation in RNA-Seq samples
• Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment
Filtering
Mapping
GeneStructure
Expression
DEG
Report
Annotation
Duplication
http://cufflinks.cbcb.umd.edu/index.html
CUFFLINKS PACKAGE• cufflinks
• assembles transcripts
• estimates their abundances
• cuffmerge
• a script called cuffmerge that you can use to merge together several Cufflinks assemblies.
• cuffdiff
• tests for differential expression
USAGE$ cufflinks [options] <aligned_reads.(sam/bam)>
Option Value Description
-o/--output-dir String Sets the name of the directory in which Cufflinks will write all of its output. The default is "./".
-p/--num-threads int Use this many threads to align reads. The default is 1.
-G/--GTF geneset Use the supplied reference annotation (a GFF file) to estimate isoform expression. It will not assemble novel transcripts.
-g/--GTF-guide genesetUse the supplied reference annotation (GFF) to guide RABT assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.
-M/--mask-file mask genesetIgnore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file.
--library-typefr-unstrandedfr-firststrand
fr-secondstrand
fr-unstranded : Standard Illuminafr-firststrand : dUTP, NSR, NNSR /fr-secondstrand : Ligation, Standard Solid
Quantification
Novel Isoforms
Improvingaccuracy
RUN
$ cd /KOGO/RNA-seq/outputs$ cufflinks -o S01 -p 1 --library-type fr-unstranded -g ../ref/ens.gtf -M ../ref/mask.gtf
S01/accepted_hits.bam
Category Option Value
Output -o/--output-dir /KOGO/RNA-seq/outputs/S01
Thread -p/--num-threads 1
Guide Geneset -g/--GTF-guide /KOGO/RNA-seq/ref/ens.gtf
Mask Geneset -M/--mask-file /KOGO/RNA-seq/ref/mask.gtf
Library Type --library-type fr-unstranded
ALGORITHM
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5.
CUFFLINKS EXPRESSION• FPKM
• Fragments Per Kilobase of exon per Million fragments mapped
• analogous to single-read “RPKM”
• Isoform expression estimation
• maximum likelihood estimation
• Normalization
• by total number of mapped reads
• by upper quantile method
OUTPUT
File Description
transcripts.gtf The GTF file contains Cufflinks ‘ assembled isoforms
isoforms.fpkm_tracking The estimated isoform-level expression values in the generic FPKM Tracking Format.
genes.fpkm_tracking The estimated gene-level expression values in the generic FPKM Tracking Format.
TRANSCRIPTS.GTFCol. Name Example Description
1 seqname chrX Chromosome or contig name
2 source Cufflinks The name of the program that generated this file (always 'Cufflinks')
3 feature exon The type of record (always either "transcript" or "exon".
4 start 77696957 The leftmost coordinate of this record (where 1 is the leftmost possible coordinate)
5 end 77712009 The rightmost coordinate of this record, inclusive.
6 score 1000 The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM)
7 strand + Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."
7 frame . Cufflinks does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used.
8 attributes ... See below.
TRANSCRIPTS.GTFAttribute Example Description
gene_id CUFF.1 Cufflinks gene id
transcript_id CUFF.1.1 Cufflinks transcript id
FPKM 101.267 Isoform-level relative abundance in Fragments Per Kilobase of exon model per Million mapped fragments
frac 0.7647 Reserved. Please ignore, as this attribute may be deprecated in the future
conf_lo 0.07 Lower bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, lower bound = FPKM * (1.0 - conf_lo)
conf_hi 0.1102 Upper bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, upper bound = FPKM * (1.0 + conf_lo)
cov 100.765 Estimate for the absolute depth of read coverage across the whole transcript
full_read_support yes When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.
FPKM TRACKING FILESCol. name Example Description1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)
2 class_code = The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present
3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any4 gene_id NM_008866 The gene_id(s) associated with the object5 gene_short_name Lypla1 The gene_short_name(s) associated with the object
6 tss_id TSS1 The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_id isn't present
7 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the object8 length 2447 The number of base pairs in the transcript, or '-' if not a transcript/primary transcript9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object10 FPKM 8.01089 FPKM of the object in sample11 FPKM_lo 7.03583 the lower bound of the 95% confidence interval on the FPKM of the object in sample12 FPKM_hi 8.98595 the upper bound of the 95% confidence interval on the FPKM of the object in sample
13 status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL
SIMPLE STATISTICS$ cd /KOGO/RNA-seq/outputs/S01# Check higest expressed genes$ sort -r -g -k 10 genes.fpkm_tracking | head -n 30# Select FPKM S$ cut -f 1,10 genes.fpkm_traking > gene_fpkm_s# $ R> data <- read.table(“gene_fpkm_s”, header=TRUE)> fpkm_s <- as.numeric(data[,2])>> mean(fpkm_s)> sd(fpkm_s)>> fpkm_s.log10 <- log(fpkm_s+1,10)> bin_seq = seq(min(fpkm_s.log10-0.1),max(fpkm_s.log10+0.1),by=0.1)> hist(fpkm_s.log10, breaks=bin_seq, xlab=‘log10(x+1)’, ylab=‘Number of genes’, axes=TRUE)>> boxplot(fpkm_s.log10)
SUMMARY• Expression Level
• Normalization
• RPKM (FPKM)
• Length Bias
• Cufflinks
• Isoforms
• maximum likelihood estimation
CUFFMERGE
RSeQCFastQC
Filtering ReadMapping Gene
StructureExpression
Level
DEGanalysis
Report
Annotation
Duplication
CUFFMERGE• Use to merge together several
Cufflinks assemblies
• Automatically filters a number of transfrags that are probably artfifacts
• The main purpose of this script is to make it easier to make an assembly GTF file suitable for use with Cuffdiff
Filtering
Mapping
GeneStructure
Expression
DEG
Report
Annotation
Duplication
Trapnell C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.Nat Protoc. 2012 Mar 1;7(3):562-78. doi: 10.1038/nprot.2012.016.
USAGE
$ cuffmerge [options] <assembly_GTF_list.txt>
Option Value Description
-o <outprefix> Write the summary stats into the text output file <outprefix>(instead of stdout)
-g/--ref-gtf geneset An optional "reference" annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
-p/--num-threads <int> Use this many threads to align reads. The default is 1.
-s/--ref-sequence <seq_dir>/<seq_fasta>
This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present. The merge script will pass this option to cuffcompare, which will use the sequences to assist in classifying transfrags and excluding artifacts (e.g. repeats). For example, Cufflinks transcripts consisting mostly of lower-case bases are classified as repeats. Note that <seq_dir> must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.
RUN
$ cd /KOGO/RNA-seq/outputs$ find ./ -iname transcripts.gtf > gtf_list.txt$ cuffmerge -p 1 -g ../ref/ens.gtf -s ../ref/chr.fa gtf_list.txt
Category Option Value
Outputprefix -o /KOGO/RNA-seq/outputs
Geneset -g/--ref-gtf /KOGO/RNA-seq/ref/ens.gtf
Thread -p/--num-threads 1
Reference -s/--ref-sequence /KOGO/RNA-seq/ref/chr.fa
RUN$ cd /KOGO/RNA-seq/outputs/merged_asm$ less transcripts.gtf$ less merged.gtf$ gffread -g /KOGO/ref/chr.fa -w transcripts.fa transcripts.gtf$ head transcripts.fa>CUFF.11.1 gene=CUFF.11GTGCATGTAACCCAAGAAGGGTTTGGCTGGGGGCTGTGGCAGCGCCAGAGTTCTGTTCGAATCCCAATTGGGTTCTGGTCACAGATTTGGCATGGAGCAGAAGAGAGATACAGCATGGTTGAAAAGCAGTTATTGGCTAC$ grep '>' transcripts.fa | head -n 30>CUFF.2.1 gene=CUFF.2>CUFF.11.1 gene=CUFF.11>ENSGALT00000015891 gene=CUFF.11>CUFF.12.1 gene=CUFF.12
DEG ANALYSIS
RSeQCFastQC
Filtering ReadMapping Gene
StructureExpression
Level
DEGanalysis
Report
Annotation
Duplication
DIFFERENTIALLY EXPRESSED GENE
• Abundance of transcripts between different conditions
Filtering
Mapping
GeneStructure
Expression
Report
Annotation
Duplication
DEG
Zhang et al., Mol Cancer Res June 2006 4; 401
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data.Genome Biol. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3-r25. Epub 2010 Mar 2.
LENGTH BIAS
Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009 Apr 16;4:14.
BIAS
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.
REPLICATESTechnicalReplicates
BiologicalReplicates
Source Same samples Different samples
Purpose the reproducibility of the results
A quantity from difference sources under the same
conditions.
Issue
The differences are based only on
technical issues in the measurement
what is similar in your replicates and how they
are different from a different set of
conditions
Taylor S, Wakem M, Dijkman G, Alsarraj M, Nguyen M. A practical approach to RT-qPCR-Publishing data that conform to the MIQE guidelines. Methods. 2010 Apr;50(4):S1-5. doi: 10.1016/j.ymeth.2010.01.005.
http://wiki.answers.com/Q/What_is_defference_between_Biological_replicates_and_technical_replicates
More variance, More useful
DEG METHODS
Cuffdiff DEGseq DESeq
- Poisson Negative binomial
Isoform Gene Gene
genesetBAM files Raw Read Count Raw Read Count
TechnicalReplicates
TechnicalReplicates
BiologicalReplicates
CUFFDIFF• Use to find significant changes in transcript expression,
splicing, and promoter use.Usage)$ cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>
Option Value Description
-o / --output-dir <string> Sets the name of the directory in which Cuffdiff will write all
of its output. The default is "./".
-L / --labels <label1,label2,...,labelN> Specify a label for each sample, which will be included in
various output files produced by Cuffdiff.
-p /--num-threads <int> Use this many threads to align reads. The default is 1.
RUN
$ cd /KOGO/RNA-seq/outputs$ cuffdiff -o Diff-S01-S02 -L S01,S02 -p 1 merged_asm/merged.gtf S01/accepted_hits.bam S02/accepted_hits.bam
Category Option Value
Output -o/--output-dir /KOGO/RNA-seq/outputs/Diff-S01-S02
Label -L / --labels S01,S02
Thread -p / --num-threads 1
OUTPUTType Files Description
Genesgenes.fpkm_trackinggenes.count_tracking
genes.read_group_tracking
Gene [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each gene_id
Isoformsisoforms.fpkm_trackingisoforms.count_tracking
isoforms.read_group_trackingTranscript [FPKMs, counts, read group tracking]
CDScds.fpkm_trackingcds.count_tracking
cds.read_group_tracking
Coding sequence [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each p_id, independent of tss_id
Primary Transcripts
tss_groups.fpkm_trackingtss_groups.count_tracking
tss_groups.read_group_tracking
Primary transcript [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each tss_id
FPKM TRACKING FILES
Col. Column name Example Description
1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)
3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any4 gene_id NM_008866 The gene_id(s) associated with the object5 gene_short_name Lypla1 The gene_short_name(s) associated with the object9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object10 q0_FPKM 8.01089 FPKM of the object in sample 0
13 q0_status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL
14 q1_FPKM 8.55155 FPKM of the object in sample 1
17 q1_status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL
$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ cut -f 1,3,4,5,9,10,13,14,17 genes.fpkm_tracking | head
OUTPUTType Files Description
Genes gene_exp.diff Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id
Isoforms isoform_exp.diff Transcript differential FPKM.
CDS cds_exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id
Primary Transcripts tss_group_exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM
of transcripts sharing each tss_id
Splicing splicing.diff how much differential splicing exists between isoforms processed from a single primary transcript
CDS cds.diff the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples
Promoter promoter.diff the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.
GENE_EXP.DIFF
Col. Name Example Description1 Tested id XLOC_000001 A unique identifier2 gene Lypla1 The gene_name(s) or gene_id(s) being tested
6 Test status NOTEST OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL
7 FPKMx 8.01089 FPKM of the gene in sample x8 FPKMy 8.551545 FPKM of the gene in sample y9 log2(FPKMy/FPKMx) 0.06531 The (base 2) log of the fold change y/x
10 test stat 0.860902 The value of the test statistic used to compute significance of the observed change in FPKM
11 p value 0.389292 The uncorrected p-value of the test statistic12 q value 0.985216 The FDR-adjusted p-value of the test statistic
13 significant no Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing
$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ cut -f 1,2,7,8,9,10,11,12,13,14 gene_exp.diff | head
SIMPLE STATISTICS$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ gnuplotgnuplot> set gridgnuplot> set zeroaxis -1gnuplot> set xlabel ‘log(FPKMs of S01)’gnuplot> set ylabel ‘log(FPKMs of S02)’gnuplot> pl ‘genes.fpkm_tracking’ u (log($10)):(log($14)) w points notitle, x notitlegnuplot> exit
$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ grep yes gene_exp.diff > gene_exp.diff.yes$ less gene_exp.diff.yes$ grep no gene_exp.diff > gene_exp.diff.no$ gnuplotgnuplot> set gridgnuplot> set zeroaxis lt 2gnuplot> set xlabel ‘log2foldchange’gnuplot> set ylabel ‘-log(p-value)’gnuplot> pl ‘gene_exp.diff.no’ u 10:(-log($12)) lt 0 no title,\ ‘gene_exp.diff.yes’ u 10:(-log($12)) lt 1 pt 6 ps 2 t ‘DE’gnuplot> exit
DESEQ• Differential gene expression analysis based
on the negative binomial distribution
• R
• raw count
• biological replicates
• http://bioconductor.org/packages/release/bioc/html/DESeq.html
http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNAseqDE_Dec2011.pdf
HTSEQ-COUNT• To count how many reads map to each feature
• Not counted for any feature for various reasons, namely:
• no_feature: reads which could not be assigned to any feature
• ambiguous: reads which could have been assigned to more than one feature and hence were not counted for any of these
• too_low_aQual: reads which were not counted due to the -a option
• not_aligned: reads in the SAM file without alignment
• alignment_not_unique: reads with more than one reported alignment. These reads are recognized from the NH optional SAM field tag.
• If you have paired-end data, you have to sort the SAM file by read name first
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
HTSEQ-COUNTIf you have paired-end data, you have to sort the SAM file by read name first
Usage)$ htseq-count [options] <sam_file> [gff_file, ensembl gtf]
Options)-m [union,intersection-strict,intersection-nonempty]-s.--stranded=<yes, no, or reverse>
whether the data is from a strand-specific assay (default: yes)
Run)$ cd /KOGO/RNA-seq/outputs/S01$ samtools sort -n accepted_hits.bam accepted_hits.nameSorted$ samtools view accepted_hits.nameSorted.bam | htseq-count -m union -s no - ../merged_asm/merged.gtf > accepted_hits.count$ less accepted_hits.count# ..... for (S02, S03, S04)
RUNRun)$ cd /KOGO/RNA-seq/outputs$ mkdir DESeq$ TBI-toolkit-make_matrix S01/hits.count 2 S02/hits.count 2 S03/hits.count 2 S04/hits.count 2 > DESeq/hits.mtx$ cd DESeq$ less hits.mtx$ cp /KOGO/RNA-seq/scripts/DESeq.4samples.R .$ R CMD BATCH DESeq.R
DESeq.2samples for 2 samples
DEG METHODS• Cuffdiff, baySeq, DESeq, edgeR and NOISeq generated consistent results
• edgeR identified more DGE than the other methods at the same cut-off, which might infer less control of type 1 error with this method
Nookaew I, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012 Nov 1;40(20):10084-97.
SUMMARY• DEG
• Replicate
• Technical replicates
• Biological replicates
• Cuffdiff
• HTSeq-count
• DESeq
REPORT(CUMMERBUND)
RSeQCFastQC
Filtering ReadMapping Gene
StructureExpression
Level
DEGanalysis
Report
Annotation
Duplication
CUMMERBUND• an R package that is designed to aid and simplify the task of
analyzing Cufflinks RNA-Seq output.
• R
• using SQLite
• cuffData.db
CUMMERBUND DB SCHEMA
RUNRun)$ cd /KOGO/outputs/Diff-S01-S02$ R> library(cummeRbund)> cuff <- readCufflinks()> cuff# Global statistics and Quality Control> disp<-dispersionPlot(genes(cuff))> disp# Density> dens<-csDensity(genes(cuff))> dens# Boxplot> b<-csBoxplot(genes(cuff))> b# Volcano> v<-csVolcanoMatrix(genes(cuff))> v> v<-csVolcano(genes(cuff),"S01","S02")
# Pairwise Scatterplots> s<-csScatter(genes(cuff),"S01","S02",smooth=T)> s# Geneset level plots> data(sampleData)> myGeneIds <- sampleIDs> myGenes <- getGenes(cuff,myGeneIds)> h<-csHeatmap(myGenes,cluster='both')> h# Barplot> b <- expressionBarplot(myGenes)> b# Cluster> ic <-csCluster(myGenes,k=4)> icp <- csClusterPlot(ic)> icp
ADDITIONAL ANALYSIS
VIEWER• IGV
• Integrative Genomics Viewer
• http://www.broadinstitute.org/igv/
Run) Generate BAM index$ cd /KOGO/RNA-seq/outputs/S01$ samtools index accepted_hits.bam$ lsaccepted_hits.bai
GO ENRICHMENT• GO annotation
• Using SwissProt
• Blastx
• Blast2Go
• InterProScan
• GO Enrichment
• GOseq
• Fisher's exact test
• DAVID
CONCLUSION• (m)RNA-seq analysis
• Reference-based method
• NGS data analysis
• RNA-seq vs. DNA-seq
• Filtering
• Low Quality
• PCR Duplication
• Mapping
• RNA mapper
• Gene Expression
• Normalization
• DEG Analysis
• RPKM
• Replicates
END