rna-seq data analysis project qi liu. from reads to differential expression raw sequence data fastq...

29
RNA-seq data analysis Project QI LIU

Upload: percival-scott-hart

Post on 18-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

RNA-seq data analysis Project

QI LIU

Page 2: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

From reads to differential expressionRaw Sequence Data

FASTQ Files

Unspliced MappingBWA, Bowtie Mapped

ReadsSAM/BAM Files

Expression Quantification

DEseq, edgeR, etc

Functional Interpretation

QC by FastQC

QC by RNA-SeQC

Spliced mapping

TopHat, MapSplice

Reads Mapping

Summarize read counts

FPKM/RPKMCufflinks

Cuffdiff

DE testing

Function enrichment

Infer networks

Integrate with other

data

Biological Insights & hypothesis

List of DE

Page 3: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

ToolsRead alignment: TopHat2 (http://ccb.jhu.edu/software/tophat/index.shtml) Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) Tools for manipulating SAM files: SAMTOOLS (http://samtools.sourceforge.net/samtools.shtml)

Counting reads in gene level:htseq-count (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html)

Downstream analysis:The R statistical computing environment (http://www.r-project.org/)R package: edgeR (http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)

Visualization: IGV (http://www.broadinstitute.org/software/igv/download)

Page 4: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

/home/igptest/path.txtsetpkgs -a python

BOWTIE2=/scratch/liuq6/software/bowtie2TOPHAT2=/scratch/liuq6/software/tophat2

export PYTHONPATH=/scratch/liuq6/software/htseqlib:$PYTHONPATHexport PATH=/home/igptest/exomesequencing/software/:$BOWTIE2:$TOPHAT2:$PATH

transcriptomeindex=/scratch/liuq6/reference/gtfindex/Homo_sapiens.GRCh37.75

genomeindex=/scratch/liuq6/reference/bowtie2_index/hg19

gtffile=/scratch/liuq6/reference/Homo_sapiens.GRCh37.75_chr1-22-X-Y-M.gtf

reference=/home/igptest/exomesequencing/reference/hg19/hg19_chr.fa

VarScan=/home/igptest/exomesequencing/software/VarScan.v2.2.10.jar

Page 5: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

• The PATH is an environment variable. It is a colon delimited list of directories that your shell searches through when you enter a command. All executables are kept in different directories on the Linux and Unix like operating systems.

• PYTHONPATH is an environment variable which you can set to add additional directories where python will look for modules and packages

Environment Variables

Page 6: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Two ways

1. source path.txt

source is a bash shell built-in command that executes the content of the file passed as argument.

2. put the scripts in .bashrc .bashrc is a file from which bash reads and executes command automatically when you log in. /home/yourusername/.bashrc

Page 7: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Practice

Before and after you load all the environmentecho $SHELL (the name of the current shell)env (the existing environment variables)echo $HOMEecho $PATHecho $gtffile

Page 8: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Reads alignment

Page 9: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

1. un-spliced mapping to

transcriptome and then genome

(bowtie)

2. “Contiguously unmappable" reads are used to predict possible splice junctions.

TopHat

Page 10: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Tophat2

• tophat2 [options]* <genome_index_base> PE_reads_1.fq.gz PE_reads_2.fq.gz

• Options:

-o/--output-dir <string> Sets the name of the directory in which TopHat will

write all of its output. The default is "./tophat_out".

-G/--GTF <GTF/GFF3 file> Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file

--transcriptome-index <dir/prefix> use the previously built transcriptome index files

The basename of the genome index to be searched

Page 11: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

TopHat output

• accepted_hits.bam. A list of read alignments in SAM format.

• junctions.bed. A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction.

• insertions.bed and deletions.bed. UCSC BED tracks of insertions and deletions reported by TopHat.

Page 12: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Practice

• tophat2 --transcriptome-index=$transcriptomeindex -o ctrl1 $genomeindex /scratch/igptest/RNAseq/data/ctrl1.fastq

Add --transcriptome-only to save time• tophat2 --transcriptome-

index=$transcriptomeindex -o ctrl1 --transcriptome-only $genomeindex /scratch/igptest/RNAseq/data/ctrl1.fastq

Page 13: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

SAMTOOLS

• samtools view • samtools sort• samtools index

Page 14: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Practice

• View the header section of bam file samtools view -h accepted_hits.bam

• Extract the alignments with MAPQ>10 samtools view -q 10 accepted_hits.bam

Page 15: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Practice

• Extract the alignments mapped to reverse strand

samtools view -f 16 accepted_hits.bam

Page 16: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Practice

• Index the alignment file (.bai file) samtools index accepted_hits.bam

Use samtools to get all the reads mapped to tp53 and myc. How many reads? Any junction reads?– TP53, 17:7,571,720-7,590,868– MYC, 8:128,748,315-128,753,680

Page 17: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Homework

• Install IGV https://www.broadinstitute.org/igv/download• Use IGV to view the bam file (need the bam

and index files)• Take a look at the reads mapped to genes MYC

and TP53

Page 18: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Homework

• Align ER1.fastq • How many reads mapped to MYC and TP53?

Are there any junction reads? Visualize the results in IGV

• Extract the alignments mapped to MYC and generate a new BAM file

• Extract the alignments with MAPQ>10 and generate a new BAM file and index the file

Page 19: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Ht-seq Given a file with aligned sequencing reads and a list of genomic features, a common task is

to count how many reads map to each feature.

Deal with reads that overlap more than one features

Page 20: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Ht-seq• Options

Page 21: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Ht-seq

• htseq-count [options] <alignment_file> <gff_file>

Output: a table with counts for each feature, followed by the special counters, which count reads that were not counted for any feature for various reasons. The names of the special counters all start with a double underscore, to facilitate filtering.

__no_feature: reads (or read pairs) which could not be assigned to any feature (set S as described above was empty).

__ambiguous: reads (or read pairs) which could have been assigned to more than one feature and hence were not counted for any of these (set Shad mroe than one element).

__too_low_aQual: reads (or read pairs) which were skipped due to the –a option __not_aligned: reads (or read pairs) in the SAM file without alignment __alignment_not_unique: reads (or read pairs) with more than one reported alignment. These reads

are recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned reads will be counted multiple times, unless they getv filtered out by due to the -a option.)

Page 22: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Practice

• htseq-count -s no -f bam -i gene_name ctrl1/accepted_hits.bam $gtffile > ctrl1.count

Page 23: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Homework

• Summarize the read counts to the gene level for ER1

• Summarize the read counts to the exon level for ctrl1

Page 24: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

edgeR

• Install R in your laptop• Install edgeR package

source("http://bioconductor.org/biocLite.R") biocLite("edgeR")

Page 25: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

edgeR

• Raw counts• Small sample size• Complicated experimental design

Page 26: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Negative binomial distribution

• Technical replicates –Poisson distribution var(ygi)=gi

• Biological variances var(ygi)=gi+

Page 27: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Normalization

• TMM (a trimmed mean of M-values)

minimize the log-fold changes between the samples for most genes.

Page 28: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Normalization• RNA-seq measures the relative abundance of

each gene in each RNA sample, but RNA output per cell

FPKM: g2-g10 TMM: g1

N Tg1 1 1000g2 1 1g3 1 1g4 1 1g5 1 1g6 1 1g7 1 1g8 1 1g9 1 1g10 1 1

Page 29: RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Differential analysisRead Data

Normalization

Estimate the dispersion

Differential expression

preprocess

visualize

MDS

PCA

Heatmap

calcNormFactors

DGEList

estimateGLMCommonDisp

estimateGLMTrendedDisp

estimateGLMTagwiseDisp

glmFit

glmLRT