kogo 2013 rna-seq analysis

RNA-SEQ ANALYSIS고준수, 송상훈, 김현민

테라젠 바이오 연구소2012. 2. 5

CONTENTS• NGS

• RNA-seq

• File Forat

• Workflow

• Preparation

• Filtering & QC

• Mapping

• PCR Duplication

• Expression

• DEG

• Report

TODAY’S KEYWORDS

NGSIllumina, Paired-End

RNA-seqmRNA, Reference-based

MappingTopHat

ExpressionCufflinks, Cuffmerge

DEGCuffdiff, DESeq

DesignReplicates

File FormatFastq, BAM

NEXT-GENERATION SEQUENCING

SEQUENCING

Sanger (1st Generation)

NEXT-GENERATION SEQUENCING

Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/nrg2626. Epub 2009 Dec 8. Sequencing technologies - the next generation. Metzker ML.

2nd Generation

3rd Generation

NGS WEAKNESS AND OVERCOMING

Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012 Jul 24;13:341.

Sanger 0.001%

Nature Biotechnology 26, 1135 - 1145 (2008), Next-generation DNA sequencing, Shendure J. and Ji H.

NGS

http://users.ugent.be/~avierstr/nextgen/nextgen.html

Library Construction Sequencing

RawReads



GENERAL NGS ANALYSIS PROCESS

Shearer AE, Hildebrand MS, Sloan CM, Smith RJ. Deafness in the genomics era. Hear Res. 2011 Dec;282(1-2):1-9. doi: 10.1016/j.heares.2011.10.001. Epub 2011 Oct 8.

Mapping1 WGS

Low depth < NT < High depth

3

Depth(Coverage)

2

Coverage

Speed

MAPPING TOOLS• Mapper Type

• DNA• RNA• miRNA• bisulphite

Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.

BWAfor

WGS

TopHatfor RNA

PCR DUPLICATIONhttp://www.clcbio.com/clc-plugin/duplicate-reads-removal-plugin/

remove

http://www.clcbio.com/clc-plugin/duplicate-reads-removal-plugin/




ILLUMINA PAIRED-END

Quinlan AR, Boland MJ, Leibowitz ML, Shumilina S, Pehrson SM, Baldwin KK, Hall IM. Genome sequencing of mouse induced pluripotent stem cells reveals retroelement stability and infrequent DNA rearrangement during reprogramming. Cell Stem Cell. 2011 Oct 4;9(4):366-73. doi: 10.1016/j.stem.2011.07.018.

Haas BJ, Zody MC.Advancing RNA-Seq analysis.Nat Biotechnol. 2010 May;28(5):421-3. doi: 10.1038/nbt0510-421.

mate-pair inner distnace

http://vallandingham.me/RNA_seq_differential_expr

ession.html


fastq_1

fastq_2

http://vallandingham.me/RNA_seq_differential_expression.html











SUMMARY• NGS platform : Short Reads, Depth, Coverage

• Sequencing Protocol

• Analysis Protocol

• Mapping

• PCR duplication

• Illumina Paired-end

TRANSCRIPTOMERNA-SEQ

TRANSCRIPTOME

• The complete set of transcripts in a cell, and their quantity

• The key aims of transcriptomics are:

• to catalogue all species of transcript, including mRNAs, non-coding RNAs and small RNAs

• to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications

• to quantify the changing expression levels of each transcript during development and under different conditions.

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.

ADVANTAGES OF RNA-SEQ


RNA-SEQ & MICROARRAY


RNA-SEQ• Gene expression level

• Relative expression level in sample

• Differentially expressed gene

• Identification of alternative spliced transcripts

• Prediction of novel transcripts

• Gene Fusion


RNA-SEQ VS. DNA-SEQ

RNA-seq DNA-seq

Methods Reference-based,de novo assembly

WES,WGS re-sequencing,

WGS de novo

Goal

Expression,Differentially Expressed Genes,

Novel transcript,Alternative splicing form,

Gene fusion

SNPs, Indels, SV

Measure Mapped Read Count Base accuracy

OVERVIEW OF A TYPICAL RNA-SEQ

RNA MAPPING

Trapnell C, Salzberg SL., How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.

Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010;11(12):220.

MAPPERMapper Data Seq.Plat. Input Output Cit. Cit/years Reference

MapSplice RNA I FASTA/Q SAM, BED 50 28.17 Wang et al. (2010)

MicroRazerS miRNA N FASTA/Q SAM, TSV 7 2.75 Emde et al. (2010)

mrFAST miRNA I FASTA/Q SAM 158 58.34 Alkan et al. (2009)

mrsFAST miRNA I,So FASTA/Q SAM 32 18.03 Hach et al. (2010)

Passion RNA I,4,Sa,P FASTA/Q BED - - Zhang et al. (2012)

PatMaN miRNA N FASTA TSV 38 9.36 Prufer et al. (2008)

QPALMA RNA I,4 Specific TSV 75 21.11 De Bona et al. (2008)

RNA-Mate RNA So CFASTA BED, Counts 28 10.04 Cloonan et al. (2009)

RUM RNA I,4 FASTA/Q SAM,TSV,BED 2 2.36 Grant et al. (2011)

SOAPSplice RNA I,4 FASTA/Q TSV 3 3.54 Huang et al. (2011)

SpliceMap RNA I FASTA/Q SAM, BED 63 29.80 Au et al. (2010)

Supersplat RNA N FASTA TSV 21 9.93 Bryant Jr et al. (2010)

TopHat RNA I FASTA/Q, GFF BAM 389 121.04 Trapnell et al. (2009)

Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.

The number of citations (Cit.) was obtained from Google Scholar on April 14, 2012

ANALYSIS STRATEGIESReference-based de novo

Method•Using a reference genome•The transcriptome assembly can be built upon it

•not use a reference genome

Adv.

• Contamination or sequencing artefacts are not a major concern• Very sensitive and can assemble transcripts of low abundance • To discover novel transcripts that are not present in the current annotation

• Not depend on a reference genome• Not depend on the correct alignment of reads to known splice sites or the prediction of novel splicing sites• Trans-spliced transcripts can be assembled

Disadv. • Depends on the quality of the reference genome being used.

• Computing resources• Senstive to sequencing errors

Depth ~ 10x > 30x

Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.

REFERENCE-BASED

Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.

SUMMARY

• Transcriptome

• RNA-seq advantages

• Process

• Analysis strategies

• Reference-based method

NGS FILE FORMAT

FILE FORMAT• NGS

• Fastq

• SAM/BAM

• VCF

• Reference

• Fasta

• GTF / GFF

S01_1.fq

S01_2.fq

FASTQ FORMAT• de factor standard file format for raw reads

• fq, fastq, fq.gz, fastq.gz 1: @title identifier description

2: Sequence

3: + description

4: Quality valuesPaired-end

Sequencer

Fastq

QUALITY SCORE• The base-calling error probabilities.

• Types

• Pred33 / Illumina 1.8+• Score 0~60• ASCII 33 ~ 126

• Solexa / Illumina 1.0• -5~62• ASCII 56 ~ 126

• Pred64 / Illumina 1.3 ~ 1.5• 0 ~ 62 • ASCII 64 ~126

http://www.asciitable.com



SAM / BAM FORMAT• SAM stands for Sequence Alignment/Map format.

• TAB-delimited text format

• 11 mandatory fields

Sequencer

Fastq

Mapper

SAM/BAM

Read Name

FlagReference

Position

QualityPos. of Mate

Length

SAM / BAMFlag

CIGAR

SAM

TOOLS FOR SAM/BAM• Samtools

• index

• view

• sort

• faidx

• flagstat

• tview

• mpileup

• Picard

• SortSam

• MarkDuplicates

• ......

GTF (ENSEMBL)

protein_coding, mtRNA, miRNA, lincRNA, pseudogene......

Gene ID Transcript ID

SUMMARY• Fastq format

• de facto standard

• Quality Score

• Pred33/Illumina 1.8+, Illumina 1.0, Pred64/Illumina 1.3~1.5

• SAM/BAM format

• GTF

WORKFLOW

REFERENCE

REFERENCE WORKFLOW

TopHat Cufflinks Cuffmerge Cuffdiff

Sample1

Sample2

Mappedreads

Mappedreads

Assembledtranscripts

Assembledtranscripts

Finaltranscriptome

assembly

Differentialexpression

results

CummeRbundExpressionplots

PicardSamtools

RSeQCFastQC

cummeRbundGO

CuffdiffDEGseqDESeq

CufflinksHTseq-count Cuffmerge

TopHatRUMBWA

Bowtie2

TBI-toolkit

OUR WORKFLOW

FilteringRead

MappingGene

StructureExpression

Level

DEGanalysis

Report

UniProtGO

KEGG

Annotation

Samples Reference Geneset

Duplication

PREPARATION

S01.fq.gz, S02.fq.gz

chr.fa, ens.gtf, mask.gtf

DIRECTORY/KOGO/RNA-seq ref

inputs

outputs S01

......

merged_asm

accepted_hits.bam, transcripts.gtf

Diff-S01-S02

merged.gtf, transcripts.gtf

gene_exp.diff, isoforms_exp.diffscripts

accepted_hits.bam, transcripts.gtf

Tools

SAMPLES

S01 S02

S03 S04

Horse 1

Horse I1

운동전 운동후

TOOLSCategory Programs Version Homepage

QC FastQC 0.10.1 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

MapperBowtie2 2.0.5 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

MapperTopHat 2.0.7 http://tophat.cbcb.umd.edu

Abundance

Cufflinks 2.0.2 http://cufflinks.cbcb.umd.edu

Abundance HTseq-count - http://www-huber.embl.de/users/anders/HTSeq/doc/count.htmlAbundance

DESeq 1.10.1 http://bioconductor.org/packages/release/bioc/html/DESeq.html

Annotation goseq 1.10.0 http://www.bioconductor.org/packages/2.11/bioc/html/goseq.html

Tools

samtools 0.1.18 http://samtools.sourceforge.net

Tools

picard 1.83 http://picard.sourceforge.net

Tools TBI-toolkit 0.1 http://dev.totalomics.kr/Tools

R 2.15.0 http://www.r-project.org

Tools

Gnuplot - http://www.gnuplot.info

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/


http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

http://tophat.cbcb.umd.edu

http://tophat.cbcb.umd.edu

http://cufflinks.cbcb.umd.edu

http://cufflinks.cbcb.umd.edu

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html


http://bioconductor.org/packages/release/bioc/html/DESeq.html


http://www.bioconductor.org/packages/2.11/bioc/html/goseq.html

http://www.bioconductor.org/packages/2.11/bioc/html/goseq.html

http://samtools.sourceforge.net

http://samtools.sourceforge.net

http://picard.sourceforge.net

http://picard.sourceforge.net

http://dev.totalomics.kr


http://www.r-project.org

http://www.r-project.org

http://www.gnuplot.info

http://www.gnuplot.info

TBI-TOOLKIT• TBI NGS Toolkit

• http://dev.totalomics.kr

• Application

• TBI-toolkit-qscore

• TBI-toolkit-fq_filter

• TBI-toolkit-gtf_selector

• TBI-toolkit-fa_spliter

• TBI-toolkit-make_matrix



REFERENCE• Reference-based strategy

Name FileType Description

Reference fasta Genome Sequence

Geneset GTF2.2/GFF3 Reference Geneset

Name Source Description

Mask Geneset Geneset Geneset that has ncRNA information.(rRNA, tRNA, and other ncRNA)

Bowtie2 Index Reference Index files for running bowtie2

GO information GO Gene ontology information for GO enrichment

Optional

REFERENCE SOURCE• Ensembl (http://www.ensembl.org)

• General file format for all species

• Geneset (GTF format)

• Constant Database schema for all species

• Comprehensive Annotation (GO, InterPro, Pfam, Prosite Smart, ...... )

• Automated Update

• UCSC (http://genome.ucsc.edu)

• Semi general file format for all species

• Semi constant Database schea for all species

• Gene table dump (BED format compatible)

• Annotation (Pfam, Kegg)

• Comparative Analysis

• NCBI

• Raw data bank

• GFF type geneset file

http://www.ensembl.org

http://www.ensembl.org

http://genome.ucsc.edu

http://genome.ucsc.edu

ENSEMBLensembl.org plants.ensembl.org fungi.ensembl.org

metazoa.ensembl.org protists.ensembl.org bacteria.ensembl.org

ENSEMBL• Homo Sapiens ( ftp://ftp.ensembl.org/pub/release-69 )

• fasta/homo_sapiens/

• dna/Homo_sapiens.GRCh37.69.dna.toplevel.fa.gz

• dna/Homo_sapiens.GRCh37.69.dna.chromosome.1.fa.gz

• cdna/Homo_sapiens.GRCh37.69.cdna.all.fa.gz

• gtf/homo_sapiens/Homo_sapiens.GRCh37.69.gtf.gz

• mysql/homo_sapiens_core_69_37/

• Arabidopsis thaliana ( ftp://ftp.ensemblgenomes.org/pub/release-16/plants )

• fasta/arabidopsis_thaliana

• dna/Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa.gz

• cdna/Arabidopsis_thaliana.TAIR10.16.cdna.all.fa.gz

• gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.16.gtf.gz

• mysql/arabidopsis_thaliana_core_16_69_10/

chr.fa

ens.gtf

PRE-PROCESSING• Check quality score type of input file

• Reference files

• Reference index

• Mask geneset

SAMPLE QUALITY SCORE

Run)$ cd /KOGO/RNA-seq/inputs$ TBI-toolkit-qscore S01_1.fq.gzSanger(Phred33) or Illumina 1.8+

0 to 93 using ASCII 33 to 1260:1, 1:”, 2:#, 3:$, 4:%, 5:&, ......

Usage)$ TBI-toolkit-qscore [FASTQ]Sanger(Phred33) or Illumina 1.8+

0 to 93 using ASCII 33 to 126

REFERENCE INDEX

Usage)$ bowtie2-build [options] <reference_in> <bt2_base>

Run)$ cd /KOGO/RNA-seq/ref$ bowtie2-build chr.fa chr.fa$ lschr.fa.1.bt2 chr.fa.2.bt2 ......

Index for bowtie2 mapper

Usage)$ samtools faidx <ref.fasta>

Run)$ cd /KOGO/RNA-seq/ref$ samtools faidx chr.fa$ lschr.fa.fai

Fasta index

MASK GENESET

Run)$ cd /KOGO/RNA-seq/ref$ TBI-toolkit-gtf_selector ens.gtf mask.gtf tRNA rRNA Mt_tRNA Mt_rRNA

Usage)$ TBI-toolkit-gtf_selector [IN GTF] [OUT GTF] [Source 1] [Source 2] ......

...... We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.

cufflinks manuals (http://cufflinks.cbcb.umd.edu/manual.html)

http://cufflinks.cbcb.umd.edu/manual.html

http://cufflinks.cbcb.umd.edu/manual.html

SUMMARY• Directory

• /KOGO/RNA-seq

• Tools

• Reference

• Pre-processing

FILTERING & QC

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

FILTERING & QC• Improving assembly accuracy

• Removing artifacts

• Sequencing adaptor

• Low quality reads

• Near-identical reads

• PCR amplification

• rRNA and other RNA

• Applications

• Filtering - TBI-toolkit, fastx-toolkit

• QC - FastQC, SolexaQC, RSeQC

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

QUALITY CONTROL• FastQC ( v0.10.1 )

• A quality control tool for high throughput sequence data.

• Java

• http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• RSeQC

• RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data

• http://code.google.com/p/rseqc/





http://code.google.com/p/rseqc/

http://code.google.com/p/rseqc/

FASTQC

Usages)$ fastqc seqfile1 seqfile2 .. seqfileN$ fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN

Arguments-f format bam,sam,bam_mapped,sam_mapped and fastq-t threads

Run)$ cd /KOGO/RNA-seq/inputs$ fastqc -f fastq -t 2 S01_1.fq.gz S01_2.fq.gz

Output)$ firefox R01_1.fq_fastqc/fastqc_report.html$ firefox R01_2.fq_fastqc/fastqc_report.html

FASTQCPer Base Sequence Quality Per Sequence Quality Scores Per Base Sequence Content Per Base GC Content

Per Sequence GC Content Per Base N Content Sequence Length Distribution Duplicate Sequences

READ FILTERING (CUTOFF)

RNA-seq DNA-seq

LowQuality

N > 10%Average QV < Q20NT (<Q20) > 40%

N > 10%Average AV < Q20 NT (<Q20) > 5%

TrimmingNo trimming

orTrimming

Trimming

FILTERINGUsages)$ TBI-toolkit filter [option*] seqfile_1 seqfile_2 output_1 output_2

Option)-n N_ratio-a integer : Average QV of read-m NT_ratio < QV

Run)$ cd /KOGO/RNA-seq/inputs$ TBI-toolkit-fq_filter -n 0.1 -m 0.4 -a 20 S01_1.fq.gz S01_2.fq.gz S01_Q20_1.fq.gz S01_Q20_2.fq.gz$ lsS01_Q20_1.fq.gz S01_Q20_2.fq.gz S01_Q20.log S01_Q20.err$ cat S01_Q20.log$ less S01_Q20.err

FASTQC

Run)$ cd /KOGO/RNA-seq/inputs$ fastqc -f fastq -t 2 S01_Q20_1.fq.gz S01_Q20_2.fq.gz

SUMMARY• Read Quality

• FastQC

• RSeQC

• Filter

MAPPING READS(TOPHAT)

RSeQCFastQC


StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

TOPHAT• TopHat is a fast splice junction mapper for RNA-

Seq reads.

• It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

Trapnell C, Salzberg SL. How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.

USAGEUsage$ tophat [options] <bowtie_index_base> <reads1_1> <reads1_2>

Option Value Description

-o/--output-dir string The default is "./tophat_out".

-p/--num-threads int Use this many threads to align reads. The default is 1.

-r/--mate-inner-dist int This is the expected (mean) inner distance between mate pairs.The default is 50bp

--mate-std-dev int The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.

--library-typefr-unstranded

fr-firststrandfr-secondstrand

fr-unstranded : Standard Illuminafr-firststrand : dUTP, NSR, NNSRfr-secondstrand : Ligation, Standard Solid

--solexa-quals - Use the Solexa scale for quality values in FASTQ files.

--solexa1.3-quals - Phred64/Illumina 1.3~1.5

-G/--GTF Geneset Geneset (GTF 2.2 or GFF3 formatted file)

--rg-id string Read group ID

--rg-sample string Sample ID

RUN$ cd /KOGO/RNA-seq/outputs$ tophat -o S01 -p 1 -r 170

--library-type fr-unstranded -G ../ref/ens.gtf --rg-id S01_Q20 --rg-sample S01_Q20../ref/chr.fa ../inputs/S01_Q20_1.fq.gz ../inputs/S01_Q20_2.fq.gz

Category Option Value

Output -o/--output-dir /KOGO/RNA-seq/outputs/S01

Thread -p/--num-threads 1

Inner Distance Mean -r/--mate-inner-dist 170

Inner distance SD. --mate-std-dev 20 (default)

Library Type --library-type fr-unstranded (Standard Illumina)

Quality Score Phred33 (default)

Geneset -G/--GTF /KOGO/RNA-seq/ref/ens_69.gtf

Read Group --rg-id--rg-sample S01_Q20

check

ALGORITHM

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11.

TOPHAT• Two step method

• Extracting the transcript sequences and using Bowtie to align reads to this virtual transcriptome first.

• Only the reads that do not fully map to the transcriptome will then be mapped on the genome.

• Optimized for reads >= 75bp

• The values in the first column of the provided GTF/GFF file must match the name of the reference sequence in the Bowtie index you are using with TopHat.

OUTPUT

Filename Types Description

accepted_hits.bam BAM A list of read alignments in SAM format.Coordinate-sorted

unmapped.bam BAM A list of unmapped read in SAM format.

junctions.bed UCSC BED A track of junctions reported by TopHat

insertions.bed UCSC BED chromLeft referes to the last genomic base before the insertion

deletions.bed UCSC BED chromLeft referes to the first genomic base before the insertion

SIMPLE ALIGNMENT VIEWUsage$ cd /KOGO/RNA-seq/output/S01$ samtools index accepted_hits.bam$ samtools tview accepted_hits.bam ../../ref/chr.fa

Key Desc? This window

Arrows Small scroll movement

H,J,K,L Large scroll movement

space Scroll one screen

backspace Scroll back one screen

g Go to specific location

m Color for mapping qual

n Color for nucleotide

b Color for base quality

. Toggle on/off dot view

q Exit

25:413751

MAPPING STATISTICSRun)$ cd /KOGO/RNA-seq/outputs/S01$ samtools flagstat accepted_hits.bam

Run)$ cd /KOGO/RNA-seq/outputs/S01$ bam_stat.py -i accepted_hits.bam

45338688 + 0 in total (QC-passed reads + QC-failed reads)0 + 0 duplicates45338688 + 0 mapped (100.00%:-nan%)45338688 + 0 paired in sequencing22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)

Total Reads (Records): 45338688

QC failed: 0Optical/PCR duplicate: 0Non Primary Hits 1861695Unmapped reads: 0Multiple mapped reads: 586067

Uniquely mapped: 42890926Read-1: 21527100Read-2: 21363826Reads map to '+': 21457407Reads map to '-': 21433519Non-splice reads: 32872272Splice reads: 10018654Reads mapped in proper pairs: 38402964

SUMMARY• TopHat

• Splice junction

• Geneset

• Two step method

• accepted_hits.bam

PCR DUPLICATES(OPTIONAL)

RSeQCFastQC


StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

PCR DUPLICATION Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

Run) $ cd /KOGO/RNA-seq/outputs/S01/$ samtools rmdup accepted_hits.bam accepted_hits.rmdup.bam

• Removing reads that have same mapping coordinates.

• Tools

• samtools - rmdup

• Picard - MarkDuplicates

Run) $ cd /KOGO/RNA-seq/outputs/S01/$ java -jar /KOGO/RNA-seq/Tools/Picard/MarkDuplicates.jar

INPUT=accepted_hits.bam OUTPUT=accpted_hits.mark_dup.bamASSUME_SORTED=true REMOVE_DUPLICATES=trueMETRICS_FILE=accpeted_hits.metric

PCR DUPLICATION

accepted_hits.bam samtools Picard (Mark) Picard (Remove)

45338688 + 0 in total0 + 0 duplicates45338688 + 0 mapped45338688 + 0 paired22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)




EXPRESSION(CUFFLINKS)

RSeQCFastQC


StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

EXPRESSINO & MODELING

Adam Roberts et al., Iden%fica%on of novel transcripts in annotated genomes using RNA-‐Seq. Bioinforma4cs, 2011, 27:2325–2329

NORMALIZATION• Read counts need to be properly normalized to extract meaningful

expression estimates

• First, RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample

• Second, the variability in the number of reads produced for each run causes fluctuations in the number of fragments mapped across samples

Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 Jun;8(6):469-77.

RPKM

• C : the number of mappable reads that fell onto the gene’s exons

• N : the total number of mappable reads in the experiment

• L : the sum of the exons in base pairs

the reads per kilobase of transcript per million mapped reads

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):621-628.

Relative Expression Level in

Sample

CUFFLINKS• Cufflinks assembles transcripts, estimates their

abundances, and tests for differential expression and regulation in RNA-Seq samples

• Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

http://cufflinks.cbcb.umd.edu/index.html



CUFFLINKS PACKAGE• cufflinks

• assembles transcripts

• estimates their abundances

• cuffmerge

• a script called cuffmerge that you can use to merge together several Cufflinks assemblies.

• cuffdiff

• tests for differential expression

USAGE$ cufflinks [options] <aligned_reads.(sam/bam)>


-o/--output-dir String Sets the name of the directory in which Cufflinks will write all of its output. The default is "./".

-p/--num-threads int Use this many threads to align reads. The default is 1.

-G/--GTF geneset Use the supplied reference annotation (a GFF file) to estimate isoform expression. It will not assemble novel transcripts.

-g/--GTF-guide genesetUse the supplied reference annotation (GFF) to guide RABT assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.

-M/--mask-file mask genesetIgnore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file.

--library-typefr-unstrandedfr-firststrand

fr-secondstrand

fr-unstranded : Standard Illuminafr-firststrand : dUTP, NSR, NNSR /fr-secondstrand : Ligation, Standard Solid

Quantification

Novel Isoforms

Improvingaccuracy

RUN

$ cd /KOGO/RNA-seq/outputs$ cufflinks -o S01 -p 1 --library-type fr-unstranded -g ../ref/ens.gtf -M ../ref/mask.gtf

S01/accepted_hits.bam


Output -o/--output-dir /KOGO/RNA-seq/outputs/S01


Guide Geneset -g/--GTF-guide /KOGO/RNA-seq/ref/ens.gtf

Mask Geneset -M/--mask-file /KOGO/RNA-seq/ref/mask.gtf

Library Type --library-type fr-unstranded

ALGORITHM

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5.

CUFFLINKS EXPRESSION• FPKM

• Fragments Per Kilobase of exon per Million fragments mapped

• analogous to single-read “RPKM”

• Isoform expression estimation

• maximum likelihood estimation

• Normalization

• by total number of mapped reads

• by upper quantile method

OUTPUT

File Description

transcripts.gtf The GTF file contains Cufflinks ‘ assembled isoforms

isoforms.fpkm_tracking The estimated isoform-level expression values in the generic FPKM Tracking Format.

genes.fpkm_tracking The estimated gene-level expression values in the generic FPKM Tracking Format.

TRANSCRIPTS.GTFCol. Name Example Description

1 seqname chrX Chromosome or contig name

2 source Cufflinks The name of the program that generated this file (always 'Cufflinks')

3 feature exon The type of record (always either "transcript" or "exon".

4 start 77696957 The leftmost coordinate of this record (where 1 is the leftmost possible coordinate)

5 end 77712009 The rightmost coordinate of this record, inclusive.

6 score 1000 The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM)

7 strand + Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."

7 frame . Cufflinks does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used.

8 attributes ... See below.

TRANSCRIPTS.GTFAttribute Example Description

gene_id CUFF.1 Cufflinks gene id

transcript_id CUFF.1.1 Cufflinks transcript id

FPKM 101.267 Isoform-level relative abundance in Fragments Per Kilobase of exon model per Million mapped fragments

frac 0.7647 Reserved. Please ignore, as this attribute may be deprecated in the future

conf_lo 0.07 Lower bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, lower bound = FPKM * (1.0 - conf_lo)

conf_hi 0.1102 Upper bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, upper bound = FPKM * (1.0 + conf_lo)

cov 100.765 Estimate for the absolute depth of read coverage across the whole transcript

full_read_support yes When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.

FPKM TRACKING FILESCol. name Example Description1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)

2 class_code = The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present

3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any4 gene_id NM_008866 The gene_id(s) associated with the object5 gene_short_name Lypla1 The gene_short_name(s) associated with the object

6 tss_id TSS1 The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_id isn't present

7 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the object8 length 2447 The number of base pairs in the transcript, or '-' if not a transcript/primary transcript9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object10 FPKM 8.01089 FPKM of the object in sample11 FPKM_lo 7.03583 the lower bound of the 95% confidence interval on the FPKM of the object in sample12 FPKM_hi 8.98595 the upper bound of the 95% confidence interval on the FPKM of the object in sample

13 status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

SIMPLE STATISTICS$ cd /KOGO/RNA-seq/outputs/S01# Check higest expressed genes$ sort -r -g -k 10 genes.fpkm_tracking | head -n 30# Select FPKM S$ cut -f 1,10 genes.fpkm_traking > gene_fpkm_s# $ R> data <- read.table(“gene_fpkm_s”, header=TRUE)> fpkm_s <- as.numeric(data[,2])>> mean(fpkm_s)> sd(fpkm_s)>> fpkm_s.log10 <- log(fpkm_s+1,10)> bin_seq = seq(min(fpkm_s.log10-0.1),max(fpkm_s.log10+0.1),by=0.1)> hist(fpkm_s.log10, breaks=bin_seq, xlab=‘log10(x+1)’, ylab=‘Number of genes’, axes=TRUE)>> boxplot(fpkm_s.log10)

SUMMARY• Expression Level

• Normalization

• RPKM (FPKM)

• Length Bias

• Cufflinks

• Isoforms

• maximum likelihood estimation

CUFFMERGE

RSeQCFastQC


StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

CUFFMERGE• Use to merge together several

Cufflinks assemblies

• Automatically filters a number of transfrags that are probably artfifacts

• The main purpose of this script is to make it easier to make an assembly GTF file suitable for use with Cuffdiff

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

Trapnell C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.Nat Protoc. 2012 Mar 1;7(3):562-78. doi: 10.1038/nprot.2012.016.

USAGE

$ cuffmerge [options] <assembly_GTF_list.txt>


-o <outprefix> Write the summary stats into the text output file <outprefix>(instead of stdout)

-g/--ref-gtf geneset An optional "reference" annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.

-p/--num-threads <int> Use this many threads to align reads. The default is 1.

-s/--ref-sequence <seq_dir>/<seq_fasta>

This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present. The merge script will pass this option to cuffcompare, which will use the sequences to assist in classifying transfrags and excluding artifacts (e.g. repeats). For example, Cufflinks transcripts consisting mostly of lower-case bases are classified as repeats. Note that <seq_dir> must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.

RUN

$ cd /KOGO/RNA-seq/outputs$ find ./ -iname transcripts.gtf > gtf_list.txt$ cuffmerge -p 1 -g ../ref/ens.gtf -s ../ref/chr.fa gtf_list.txt


Outputprefix -o /KOGO/RNA-seq/outputs

Geneset -g/--ref-gtf /KOGO/RNA-seq/ref/ens.gtf


Reference -s/--ref-sequence /KOGO/RNA-seq/ref/chr.fa

RUN$ cd /KOGO/RNA-seq/outputs/merged_asm$ less transcripts.gtf$ less merged.gtf$ gffread -g /KOGO/ref/chr.fa -w transcripts.fa transcripts.gtf$ head transcripts.fa>CUFF.11.1 gene=CUFF.11GTGCATGTAACCCAAGAAGGGTTTGGCTGGGGGCTGTGGCAGCGCCAGAGTTCTGTTCGAATCCCAATTGGGTTCTGGTCACAGATTTGGCATGGAGCAGAAGAGAGATACAGCATGGTTGAAAAGCAGTTATTGGCTAC$ grep '>' transcripts.fa | head -n 30>CUFF.2.1 gene=CUFF.2>CUFF.11.1 gene=CUFF.11>ENSGALT00000015891 gene=CUFF.11>CUFF.12.1 gene=CUFF.12

DEG ANALYSIS

RSeQCFastQC


StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

DIFFERENTIALLY EXPRESSED GENE

• Abundance of transcripts between different conditions

Filtering

Mapping

GeneStructure

Expression

Report

Annotation

Duplication

DEG

Zhang et al., Mol Cancer Res June 2006 4; 401

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data.Genome Biol. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3-r25. Epub 2010 Mar 2.

LENGTH BIAS

Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009 Apr 16;4:14.

BIAS

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.

REPLICATESTechnicalReplicates

BiologicalReplicates

Source Same samples Different samples

Purpose the reproducibility of the results

A quantity from difference sources under the same

conditions.

Issue

The differences are based only on

technical issues in the measurement

what is similar in your replicates and how they

are different from a different set of

conditions

Taylor S, Wakem M, Dijkman G, Alsarraj M, Nguyen M. A practical approach to RT-qPCR-Publishing data that conform to the MIQE guidelines. Methods. 2010 Apr;50(4):S1-5. doi: 10.1016/j.ymeth.2010.01.005.

http://wiki.answers.com/Q/What_is_defference_between_Biological_replicates_and_technical_replicates

More variance, More useful







DEG METHODS

Cuffdiff DEGseq DESeq

- Poisson Negative binomial

Isoform Gene Gene

genesetBAM files Raw Read Count Raw Read Count

TechnicalReplicates

TechnicalReplicates

BiologicalReplicates

CUFFDIFF• Use to find significant changes in transcript expression,

splicing, and promoter use.Usage)$ cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>


-o / --output-dir <string> Sets the name of the directory in which Cuffdiff will write all

of its output. The default is "./".

-L / --labels <label1,label2,...,labelN> Specify a label for each sample, which will be included in

various output files produced by Cuffdiff.

-p /--num-threads <int> Use this many threads to align reads. The default is 1.

RUN

$ cd /KOGO/RNA-seq/outputs$ cuffdiff -o Diff-S01-S02 -L S01,S02 -p 1 merged_asm/merged.gtf S01/accepted_hits.bam S02/accepted_hits.bam


Output -o/--output-dir /KOGO/RNA-seq/outputs/Diff-S01-S02

Label -L / --labels S01,S02

Thread -p / --num-threads 1

OUTPUTType Files Description

Genesgenes.fpkm_trackinggenes.count_tracking

genes.read_group_tracking

Gene [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each gene_id

Isoformsisoforms.fpkm_trackingisoforms.count_tracking

isoforms.read_group_trackingTranscript [FPKMs, counts, read group tracking]

CDScds.fpkm_trackingcds.count_tracking

cds.read_group_tracking

Coding sequence [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each p_id, independent of tss_id

Primary Transcripts

tss_groups.fpkm_trackingtss_groups.count_tracking

tss_groups.read_group_tracking

Primary transcript [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each tss_id

FPKM TRACKING FILES

Col. Column name Example Description

1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)

3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any4 gene_id NM_008866 The gene_id(s) associated with the object5 gene_short_name Lypla1 The gene_short_name(s) associated with the object9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object10 q0_FPKM 8.01089 FPKM of the object in sample 0

13 q0_status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

14 q1_FPKM 8.55155 FPKM of the object in sample 1

17 q1_status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ cut -f 1,3,4,5,9,10,13,14,17 genes.fpkm_tracking | head

OUTPUTType Files Description

Genes gene_exp.diff Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id

Isoforms isoform_exp.diff Transcript differential FPKM.

CDS cds_exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id

Primary Transcripts tss_group_exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM

of transcripts sharing each tss_id

Splicing splicing.diff how much differential splicing exists between isoforms processed from a single primary transcript

CDS cds.diff the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples

Promoter promoter.diff the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.

GENE_EXP.DIFF

Col. Name Example Description1 Tested id XLOC_000001 A unique identifier2 gene Lypla1 The gene_name(s) or gene_id(s) being tested

6 Test status NOTEST OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

7 FPKMx 8.01089 FPKM of the gene in sample x8 FPKMy 8.551545 FPKM of the gene in sample y9 log2(FPKMy/FPKMx) 0.06531 The (base 2) log of the fold change y/x

10 test stat 0.860902 The value of the test statistic used to compute significance of the observed change in FPKM

11 p value 0.389292 The uncorrected p-value of the test statistic12 q value 0.985216 The FDR-adjusted p-value of the test statistic

13 significant no Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ cut -f 1,2,7,8,9,10,11,12,13,14 gene_exp.diff | head

SIMPLE STATISTICS$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ gnuplotgnuplot> set gridgnuplot> set zeroaxis -1gnuplot> set xlabel ‘log(FPKMs of S01)’gnuplot> set ylabel ‘log(FPKMs of S02)’gnuplot> pl ‘genes.fpkm_tracking’ u (log($10)):(log($14)) w points notitle, x notitlegnuplot> exit

$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ grep yes gene_exp.diff > gene_exp.diff.yes$ less gene_exp.diff.yes$ grep no gene_exp.diff > gene_exp.diff.no$ gnuplotgnuplot> set gridgnuplot> set zeroaxis lt 2gnuplot> set xlabel ‘log2foldchange’gnuplot> set ylabel ‘-log(p-value)’gnuplot> pl ‘gene_exp.diff.no’ u 10:(-log($12)) lt 0 no title,\ ‘gene_exp.diff.yes’ u 10:(-log($12)) lt 1 pt 6 ps 2 t ‘DE’gnuplot> exit

DESEQ• Differential gene expression analysis based

on the negative binomial distribution

• R

• raw count

• biological replicates

• http://bioconductor.org/packages/release/bioc/html/DESeq.html

http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNAseqDE_Dec2011.pdf









HTSEQ-COUNT• To count how many reads map to each feature

• Not counted for any feature for various reasons, namely:

• no_feature: reads which could not be assigned to any feature

• ambiguous: reads which could have been assigned to more than one feature and hence were not counted for any of these

• too_low_aQual: reads which were not counted due to the -a option

• not_aligned: reads in the SAM file without alignment

• alignment_not_unique: reads with more than one reported alignment. These reads are recognized from the NH optional SAM field tag.

• If you have paired-end data, you have to sort the SAM file by read name first




HTSEQ-COUNTIf you have paired-end data, you have to sort the SAM file by read name first

Usage)$ htseq-count [options] <sam_file> [gff_file, ensembl gtf]

Options)-m [union,intersection-strict,intersection-nonempty]-s.--stranded=<yes, no, or reverse>

whether the data is from a strand-specific assay (default: yes)

Run)$ cd /KOGO/RNA-seq/outputs/S01$ samtools sort -n accepted_hits.bam accepted_hits.nameSorted$ samtools view accepted_hits.nameSorted.bam | htseq-count -m union -s no - ../merged_asm/merged.gtf > accepted_hits.count$ less accepted_hits.count# ..... for (S02, S03, S04)

RUNRun)$ cd /KOGO/RNA-seq/outputs$ mkdir DESeq$ TBI-toolkit-make_matrix S01/hits.count 2 S02/hits.count 2 S03/hits.count 2 S04/hits.count 2 > DESeq/hits.mtx$ cd DESeq$ less hits.mtx$ cp /KOGO/RNA-seq/scripts/DESeq.4samples.R .$ R CMD BATCH DESeq.R

DESeq.2samples for 2 samples

DEG METHODS• Cuffdiff, baySeq, DESeq, edgeR and NOISeq generated consistent results

• edgeR identified more DGE than the other methods at the same cut-off, which might infer less control of type 1 error with this method

Nookaew I, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012 Nov 1;40(20):10084-97.

SUMMARY• DEG

• Replicate

• Technical replicates

• Biological replicates

• Cuffdiff

• HTSeq-count

• DESeq

REPORT(CUMMERBUND)

RSeQCFastQC


StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

CUMMERBUND• an R package that is designed to aid and simplify the task of

analyzing Cufflinks RNA-Seq output.

• R

• using SQLite

• cuffData.db

CUMMERBUND DB SCHEMA

RUNRun)$ cd /KOGO/outputs/Diff-S01-S02$ R> library(cummeRbund)> cuff <- readCufflinks()> cuff# Global statistics and Quality Control> disp<-dispersionPlot(genes(cuff))> disp# Density> dens<-csDensity(genes(cuff))> dens# Boxplot> b<-csBoxplot(genes(cuff))> b# Volcano> v<-csVolcanoMatrix(genes(cuff))> v> v<-csVolcano(genes(cuff),"S01","S02")

# Pairwise Scatterplots> s<-csScatter(genes(cuff),"S01","S02",smooth=T)> s# Geneset level plots> data(sampleData)> myGeneIds <- sampleIDs> myGenes <- getGenes(cuff,myGeneIds)> h<-csHeatmap(myGenes,cluster='both')> h# Barplot> b <- expressionBarplot(myGenes)> b# Cluster> ic <-csCluster(myGenes,k=4)> icp <- csClusterPlot(ic)> icp

ADDITIONAL ANALYSIS

VIEWER• IGV

• Integrative Genomics Viewer

• http://www.broadinstitute.org/igv/

Run) Generate BAM index$ cd /KOGO/RNA-seq/outputs/S01$ samtools index accepted_hits.bam$ lsaccepted_hits.bai

http://www.broadinstitute.org/igv/

http://www.broadinstitute.org/igv/

GO ENRICHMENT• GO annotation

• Using SwissProt

• Blastx

• Blast2Go

• InterProScan

• GO Enrichment

• GOseq

• Fisher's exact test

• DAVID

CONCLUSION• (m)RNA-seq analysis

• Reference-based method

• NGS data analysis

• RNA-seq vs. DNA-seq

• Filtering

• Low Quality

• PCR Duplication

• Mapping

• RNA mapper

• Gene Expression

• Normalization

• DEG Analysis

• RPKM

• Replicates

kogo 2013 rna-seq analysis

Health & Medicine