part 4 of rna-seq for de analysis: extracting count table and qc

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

RNA-seq for DE analysis training

Generating the count table and validating assumptions

Joachim Jacob22 and 24 april 2014

http://www.bits.vib.be/

2 of 40

Overview

http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

3 of 40

Bioinformatics analysis will take most of your time

Quality control (QC) of raw reads

Preprocessing: filtering of reads and read parts, to help our goal of differential detection.

QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)

QC of the mapping

Count table extraction

QC of the count table

DE test

Biological insight

1

2

3

4

5

6

4 of 40

Goal

We need to summarize the read counts per gene from a mapping result.

The outcome is a raw count table on which we can perform some QC, to validate the experimental setup.

This table is used by the differential expression algorithm to detect DE genes.

5 of 40

Status

20M

25M

15M

~16%

~5%

~10%

6 of 40

Tools to count 'features'

● 'Features' = type of annotation on a genome = exons in our case.

● Different tools exist to accomplish this

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting

7 of 40

The challenge in counting'Exons' are the type of features used here.

They are summarized per 'gene'

Concept:GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 readsGeneB = exon 1 + exon 2 + exon 3 = 180 reads

No normalization yet! Just pure counts, aka 'raw counts',

Overlaps no feature

Alt splicingMapping result of RNA-seq data

8 of 40

Dealing with ambiguity

● Genes, often consist of different isoforms. These contain different exons, some shared between them, some not. Furthermore...

● Reads that do not overlap a feature, but appear in introns. Take into account?

● Reads that align to more than one gene? Transcripts can be overlapping - perhaps on different strands. (PE, and strandedness can resolve this partially).

● Reads that partially overlap a feature, not following known annotations.

9 of 40

The tool HTSeq-count has 3 modes

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

HTSeq-count recommends the 'union mode'. But depending on your genome, you may opt for the 'intersection_strict mode'. Galaxy allows experimenting!

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

10 of 40

Indicate the SE or PE nature of your data(note: mate-pair is not

appropriate naming here)

The annotation file with the coordinatesof the features to be counted

mode

Check with mapping QC (see earlier)

For RNA-seq DE we summarize over'exons' grouped by 'gene_id'. Make surethese fields are correct in your GTF file.

Reverse stranded: heck with mapping viz

11 of 40

Resulting count table column

One sample !

12 of 40

Merging to create experiment count table

Tool 'Column join'

13 of 40

Resulting count table

14 of 40

Quality control of count table

In the end, we used about 70% of the reads. Check for your experiment.

Relative numbers Absolute numbers

15 of 40

Quality control of count table

2 types of QC:● General metrics● Sample-specific quality control

16 of 40

QC: general metrics

● General numbersTotal number of counted reads

17 of 40

QC: general metrics

● General numbers

18 of 40

QC: general metrics

Which genes are most highly present? Which fractions do they occupy?

42 genes (0,0063%) of the 6665 genes take 25% of all counts.

This graph can be constructed from the count table.

Gene Counts

TEF1alpha, putative ribo prot,...

19 of 40

QC: general metrics

● We can plot the counts per sample: filter out the '0', and transform on log2.

log2(count)

The bulk of the genes have countsin the hundreds.

Few are extremely highly expressed

A minority have extremely low counts

20 of 40

QC: log2 density graph

● We can do this for all samples, and merge

Strange Deviation

here

All samples show nice overlap, peaks

are similar

21 of 40

QC: log2 merging samples

Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples.

You can conclude different things when a horizontal or vertical shift of the graph, is appearing.

22 of 40

QC: rarefaction curve

Code:ggplot(data = nonzero_counts, aes(total, counts)) + geom_line() + labs(x = "total number of sequenced reads", y = "number of genes with counts > 0")

What is the number of total detected features, how does the feature space increase with each additional sample added?

There should be saturation, but here there is none.

23 of 40

QC: rarefaction curve

Saturation: OK!

….Sa

mp

le A

Sam

ple

A +

sam

ple

BSa

mp

le A

+ s

amp

le B

+ s

amp

le C

Etc.

24 of 40

Alternative to log2 transformations

● Log2 transformations suffer from bloated variance.

http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

VSTrLogLog2

Not normalizations!

http://www.biomedcentral.com/1471-2105/14/91



25 of 40

QC: count transformations

● Other transformations do not have this behavior, especially VST.


VSTrLogLog2

Not normalizations!




26 of 40

Alternative to log2 transformations

Regularized log (rLog) and 'Variance Stabilizing Transformation' (VST) as alternatives to log2.


rLog VST


27 of 40

Beyond simple metrics QC

● We can also include condition information, to interpret our QC better. For this, we need to gather sample information.

● Make a separate file

in which sample info

is provided (metadata)

28 of 40

QC with condition information

What are the differences in counts in each sample

dependent on? Here: counts are dependent on the treatment and the strain. Must match

the sample descriptions file.

29 of 40

QC with condition infoClustering of the distance between samples based on transformed counts can reveal sample errors.

VST transformed rLog transformed

Colour scaleOf the distance

measure between Samples. Similar conditions

Should cluster together

30 of 40

QC with condition infoClustering of transformed counts can reveal sample errors.

VST transformed rLog transformed

Biological samplesShould cluster

together

31 of 40

QC with condition info

Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.

32 of 40

Collect enough metadata

Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.

Why do these lie so close together?

33 of 40

You can never collect enough

During library preparation, collect as much as information as possible, to add to the sample descriptions. Pay particular attention to differences between samples: e.g. day of preparation, centrifuges used, ...

34 of 40


In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect).

Additional metadata

35 of 40


In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect).

Day 1

Day 2

36 of 40


Days are includedAnd give us more

insight

37 of 40

Next step

Now we know our data from the inside out, we can run a DE algorithm on the count table!

38 of 40

KeywordsRaw counts

Count table

Overlapping features

Density graph

Rarefaction curve

Count transformation

VST

Sample metadata

PCA plot

Write in your own words what the terms mean

39 of 40

Exercises

● → Extracting counts and doing QC

http://wiki.bits.vib.be/index.php/RNA-Seq_analysis_for_differential_expression#Extracting_counts_and_investigating_experimental_factors

40 of 40

Break

part 4 of rna-seq for de analysis: extracting count table and qc

Science