gene expression analysis of rna-seq...

63
Gene expression analysis of RNA-Seq data Dena Leshkowitz, Introduction to Deep-Sequencing Data Analysis 2018 Bioinformatics Unit, LSCF, WIS

Upload: others

Post on 15-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Gene expression analysis of RNA-Seq data

Dena Leshkowitz,

Introduction to Deep-Sequencing Data Analysis 2018

Bioinformatics Unit, LSCF, WIS

Page 2: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Potential

RNA-Seq: a revolutionary tool for transcriptomics

In theory RNA-Seq can be used to build a complete map of the transcriptome across all cell types, perturbations and states (Trapnell C. et al, Nature methods 6 469-477(2011))

Page 3: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Applications

RNA-Seq: a revolutionary tool for transcriptomics

Discover novel transcripts

Determine transcript structure, measure

transcripts expression and detect differentially

expressed transcripts/isoforms between

conditions, treatments…

Measure gene expression and detect

differentially expressed genes between

conditions, treatments… based on known gene

structures (model organisms)

Page 4: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Main Topics

Introduction

Experimental design issues

Analysing Gene expression from RNA-Seq data

RNA-Seq pipeline: STAR –DESeq2

Page 5: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Workflow

Condition 1(normal colon)

Condition 2(colon tumor)

Isolate RNAs

Sequence ends

Million/billions of fragments are sequenced

Converted to amplified small cDNA fragments with

linkers at the endsSamples of interest

Assemble or map to genome or to a

transcriptome

Quantify and detect differential expression

Modified - www.med.nyu.edu/rcr/rcr/course/RNA-seq.pptx

Page 6: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

High Throughput Genomics

DNA Microarrays

IlluminaHiSeq2500

NextSeq500

Page 7: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Microarrays vs RNA-Seq

Both high throughput methods can profile the genes with similar performance

Microarrays suffer from compression (saturation) at the high end

Low expression is problematic in both platforms

Malone et al. BMC Biol. 2011; 9: 34.

RNA-Seq (log2)

Mic

roar

ray

(log

2)

Page 8: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Microarray & RNA-SeqPros and Cons

Microarrays RNA-SeqGene model organism/Transcript and de novo assembly

Cost $-$$ $/$$$

Biases Decade of research and solutions

Understanding is evolving

Data sizes Mb -images Gb- sequence data

Dynamic range 102 105

Transcript discovery , isoform identification

& Transcript-chimeras

No No/Yes

Genome required Yes Yes/No

Allele specific expression

No No/Yes

Page 9: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Main Topics

Introduction

Experimental design issues

Analysing Gene expression from RNA-Seq data

RNA-Seq pipeline: STAR –DESeq2

Page 10: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

mRNA in the RNA “World”

Most abundant RNA is rRNA – 98%

Illumina standard protocol enriches for mRNA by: oligo(dT)-based affinity

matrices

Sequence: rRNA capture beads (Ribo-Zero)

Size : hybridization-based rRNA depletion (Duplex-‐specific Nuclease (DSN))

Page 11: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Experiment DesignSequencing Options

Illumina NextSeq/HiSeq Sequencing options:

Length of sequence (up to 300 bases)

Paired-end (PE) or single-end (SE)

Both PE and longer length sequencing increase the sensitivity and specificity of the detection of the alternative splicing and novel transcripts

PE 50

PE 100

SE 50

SE 100

Page 12: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Experimental Design Mammalian tissue

Liu Y. et al., 2014; ENCODE 2011 RNA-Seq

Differential gene expression profiling

10-25M 50 base single-end

Alternative splicing

50-100M100 base paired-end

Allele specific expression

50-100M100 base paired-end

De novo assembly >100M100 base paired-end

(5M Bulk Mars-seq)

Page 13: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

ENCODE consortium’s Standards, Guidelines and Best Practices for RNA-Seq

Page 14: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Experimental Design

Assessing biological variation requires biological replicates - (2X2) are a minimum, yet more are recommended

Avoid batch effects - technical sources of variation that have been added to samples during processing

Consult with the person which will analyse the data before performing the experiment – Kick-off meeting

Page 15: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Main Topics

Introduction

Experimental design issues

Analysing Gene expression from RNA-Seq data

RNA-Seq pipeline: STAR –DESeq2

Page 16: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Abigail Yu Nature Methods 10, 1165–1166 (2013)

RNA-Seq is a straightforward process: you isolate RNA, sequence it with a high-throughput sequencer, and put it all back together. What is the problem?

Page 17: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

What to do with all this data??

HELP !!!!

I just got sequence data…

Page 18: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Workflow

Raw ReadsPreprocessing & Quality

control (cutadapt & FASTQC)

Page 19: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Example of QC report

Page 20: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Pre-processing Need to assess that the sequence data has

sufficient quality

Recommendation is to use the high quality sequence data (more important in de novo assembly): Check the amount of read duplication (indication

of too much PCR amplification)

Trim sequences if: the end is of low quality, contain adapter or polyA-T

Filter low quality reads

Avoid using samples with insufficient number of reads

cuta

dap

t

Page 21: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Workflow

Raw ReadsPreprocessing & Quality

control (cutadapt & FASTQC)

Map the reads of each sample to the genome

(STAR)

Page 22: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Mapping Short RNA-Seq Reads

Do I align the reads to the genome or to the transcriptome?

Novel discovery

Page 23: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Mapping to GenomeHow to align reads that span

exons?

http://en.wikipedia.org/wiki/RNA-Seq

Page 24: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length
Page 25: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length
Page 26: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

The Challenge –Identifying NovelJunctions

Reads are short and contain errors

Rarely transcribed genes have few reads spanning the junctions

We are interested in discovering novel junctions i.e. we are not only relying on annotation of known genes

Need to perform the task in a timely manner

Page 27: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Tophat: Exon first two step approach

Mapping to the genome is done with Bowtie

Extracting unmapped reads (not including low complexity)

Page 28: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length
Page 29: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Copyright restrictions may apply.

Trapnell, C. et al. Bioinformatics 2009 25:1105-1111;

doi:10.1093/bioinformatics/btp120

The junctions can be improved and extended to contain donor and accepter sites

Page 30: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Copyright restrictions may apply.

Trapnell, C. et al. Bioinformatics 2009 25:1105-

1111; doi:10.1093/bioinformatics/btp120

Page 31: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Accepted_hits.bam

Insertion.bed

Deletion.bedJunctions.bed

Tophat

FastqGenome indexed

Tophat Outputs

Transcript discovery/Transcript or Gene Quantification/View in a Genome Browser

Page 32: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Junction.bed

Page 33: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.

Page 34: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Newer Aligners – Improving speed

Page 35: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Steven Salzberg - Transcriptome Assembly Computational Challenges of Next Generation Sequence Datahttps://www.youtube.com/watch?v=2qGiw4MRK3c

Page 36: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

HISAT

Steven Salzberg - Transcriptome Assembly Computational Challenges of Next Generation Sequence Datahttps://www.youtube.com/watch?v=2qGiw4MRK3c

It first tries to find candidate locations across the target genome starting from the left side of the read

It identifies these locations by first mapping part of each read using the global FM index, which in most cases identifies one or a small number of candidates

HISAT then selects one of the ~48,000 local indexes for each candidate and uses it to align the remainder of the

read.

Page 37: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

STAR-Spliced Transcripts Alignment to a Reference

STAR seed finding phase is the sequential search for a Maximal Mappable Prefix (MMP)

In the first step, the algorithm finds the MMP starting from the first base of the read

Next, the MMP search is repeated for the unmapped portion of the read

Speed of search is achieved since the suffix array is not compressed and therefore requires increased memory usage (30-50Gb)

Last stage is Clustering, stitching and scoring

Page 38: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Examples of Input and Output

Mapped Reads – bam format

Sequences – fastq

Mapping to genome

Page 39: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Mapping Output - Alignment file SAM or binary BAM file

Page 40: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Visualization of Tophat outputs in a Genome Browser (IGV)

read mappedto exon

read mappedto junction

Page 41: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Workflow

Raw ReadsPreprocessing & Quality

control (cutadaptFASTQC)

Map the reads of each sample to the genome

(STAR)

Quantify gene counts

per sample

(STAR)

Page 42: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Basic Quantification Step

Count the number of reads aligned to each gene (do not need to determine to from which transcript the read was derived)

Trapnell et al. Nat Biotechnol. 2010 May;28(5):511-5.

Page 43: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

HTSeq or STAR

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

A gene is quantified by counting the number of fragments/reads which align to all its exons

Discard a read if it is non-uniquely mapped to a gene

Page 44: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

STAR/HTSeq ResultA count matrix

sample_1 sample_2 sample_3 sample_4

gene_1 15 9 11 18

gene_2 19 21 21 40

gene_3 106 114 153 207

gene_4 569 565 756 992

gene_5 1029 1260 1559 1968

gene_6 5049 5897 7537 10029

SUM 10M 50M 30M 20M

Need to account for the differences in sequence amount between the samples

Page 45: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Workflow

Raw Reads

Preprocessing & Quality control

(cutadaptFASTQC)

Map the reads of each sample to the

genome (STAR)

Quantify gene

counts per sample

(STAR)

Detect differentially

expressed genes

(DESeq2)

Page 46: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

DESeq2 Normalizationmedian-of-ratios method

Create a “virtual reference sample” by taking, for each gene, the geometric mean of counts over all samples

Normalize each sample to this reference, to get one scaling factor (“size factor”) per sample.

Page 47: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

DESeq2 Normalization Need to normalize the amount of sequence data between the samples1. Geometric mean is calculated for each gene across

all samples.

2. The counts for a gene in each sample is then divided by this mean.

3. The median of these ratios in a sample is the size factor for that sample.

This procedure corrects for library size and RNA composition bias, which can arise for example when only a small number of genes are very highly expressed in one experiment condition but not in the other.

Page 48: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Example-DESeq Normalization

sample_1 sample_2 sample_3 sample_4 geometric mean ratio sample_1

gene_1 15 9 11 18 12.79 1.17

gene_2 19 21 21 40 24.06 0.79

gene_3 106 114 153 207 139.87 0.76

gene_4 569 565 756 992 700.73 0.81

gene_5 1029 1260 1559 1968 1412.26 0.73

gene_6 5049 5897 7537 10029 6887.68 0.73

Median 0.77

The scaling factor for sample 1 is 0.77

Page 49: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Determining Differentially Expressed Genes

http://www.molgen.mpg.de/1242892/rnaseq.pdf

The advantage of having many replicates allows to learn about the biological variation within the conditions tested.Aim: finding genes which have a significant difference between the groups which is larger than the “noise” – variation within the groups

Discover genes showing different average expression levels across two groups

Page 50: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq NoiseSuppose we sequence the same library twice to the same depth. Will we get the same gene counts?

No

Poisson noise – the variance in counts that persists even if everything is exactly the same – in our case we ran the same library

Poisson distribution is sometimes called the law of small numbers because it is the probability distribution of the number of occurrences of an event that happens rarely but has very many opportunities to happen.

Page 51: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Poisson Distribution

Assuming the gene counts in a RNA-Seq experiment follow a Poisson distribution we would expect that the average gene count and the variance of the counts are equal.

var = µ

https://youtu.be/fxtB8c3u6l8

https://youtu.be/HK7WKsL3c2w

Page 52: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Biological Variation

When we sequence biological replicate samples the concentration of a given gene will vary around a mean value with a certain standard deviation

This standard deviation cannot be calculated, it has to be estimated from the data

var = µ + c µ

Poisson noise Biological noise

Page 53: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Negative Binomial

Expected by Poisson

Negative binomial distribution

In RNA-Seq analysis the negative binomial distribution is used as an alternative to the Poisson since it takes into account variance that exceeds the gene mean

The count data is used to estimate the varianceOrange line: the fitted observed curve for the variance

Page 54: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Detecting Differentially Expressed Genes

DESeq2 tests for differential expression by the use of negative binomial generalized linear models

The output consists of: Log fold change (treatment/control)

p-value - indicates the probability that the observed difference between treatment and control will be observed even though the there is no true treatment effect

Adjusted p value – multiple test correction In the RNA-Seq study we simultaneously tested all

genes

Page 55: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

RNA-Seq Workflow

Raw Reads

Preprocessing & Quality control

(cutadaptFASTQC)

Map the reads of each sample to the

genome (STAR)

Quantify gene

counts per sample

(STAR)

Detect differentially

expressed genes

(DESeq2)

Page 56: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Main Topics

Introduction

Experimental design issues

Analysing RNA-Seq data

RNA-Seq pipeline: STAR –DESeq2

In exercises 2 & 5 we will use run this pipeline

Page 57: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Exercise 2 In this current study we analyzed the dynamics of gene expression in A. thaliana meristem during the transition to flowering using RNA sequencing (RNA-seq).

Page 58: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Submitting jobs to the WEXAC cluster

Page 59: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Examplesbsub -q bio-guest -R "rusage[mem=10000] span[hosts=1]" -J

sam1660397 -e ./samtools1660397.err -o ./samtools1660397.out "samtools index

SRR1660397Aligned.sortedByCoord.out.bam"

bsub -q bio-guest -R "rusage[mem=2000] span[hosts=1]" -o ~/RNA-Seq-Arabodopsis-map/star_job_SRR1660397.txt -e ~/RNA-Seq-Arabodopsis-map/star_error_SRR1660397.txt -J st1 -n 10 < ~/course_2018/Arabidopsis_input/commands/star_command_SRR1660397.txt

STAR --outFilterMultimapNmax 1 --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --twopassMode Basic --runThreadN 10 --sjdbGTFfile/shareDB/iGenomes/Arabidopsis_thaliana/NCBI/TAIR10/Annotation/Genes/genes.gtf --quantMode GeneCounts --outFileNamePrefix SRR1660397 --genomeDir/shareDB/iGenomes/Arabidopsis_thaliana/NCBI/TAIR10/Sequence/STAR_index/ --readFilesIn~/course_2018/Arabidopsis_input/data/SRR1660397_day8/SRR1660397_day8_R1.fastq> log_SRR1660397.txt 2>&1

Page 60: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

References

1. Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat Protoc. 2013;8(9):1765-86. doi: 10.1038/nprot.2013.099. PubMed PMID: 23975260. (DESeq2)

2. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105-11. doi: 10.1093/bioinformatics/btp120. PubMed PMID: 19289445; PubMed Central PMCID: PMCPMC2672628.

3. Anders S, Pyl PT, Huber W. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166-9. doi: 10.1093/bioinformatics/btu638. PubMed PMID: 25260700.

4. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. doi:10.1093/bioinformatics/bts635.

Page 61: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

THE END

Thanks

Questions??

Page 62: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Experiment DesignStrand specific protocol

Nature Methods 7, 709–715 (2010)

Why is strand information important?

How is the stranded library made?

Page 63: Gene expression analysis of RNA-Seq datadors.weizmann.ac.il/course/course2018/GeneExpression...Experiment Design Sequencing Options Illumina NextSeq/HiSeq Sequencing options: Length

Expression ValuesFragments (Reads) Per Kilobase of exon per Million mapped fragments

Nat Methods. 2008, Mapping and quantifying mammalian transcriptomesby RNA-Seq. Mortazavi A et al.

C= the number of fragments mapped onto the gene's exons

N= total number of (mapped) fragments in the experiment

L= the length of the transcript (sum of exons)

NLi

CiFPKMi 36 1010