next generation sequencing - school of public healthcavanr/ngslecturepubh74452013.pdf · 2013. 11....

184
Next Generation Sequencing Cavan Reilly November 22, 2013

Upload: others

Post on 21-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Next Generation Sequencing

Cavan Reilly

November 22, 2013

Page 2: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Table of contents

Next generation sequencing

NGS and microarrays

Study design

Quality assessment

Burrows Wheeler transformBWT example

Introduction to UNIX-like operating systems

Installing programsBowtieSAMtools

RNA-SeqAligning reads to the transcriptomeTopHatCufflinksRNA-Seq using TopHat and CufflinksRNAseq and count models

DNA-SeqBCFtoolsVariant Pattern AnalyzerGenome Analysis ToolkitMicrobiomics

Page 3: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Next generation sequencing

Over the last 5 years or so there has been rapid development ofmethods for next generation sequencing.

Here is the process for the Illumina technology (one of the 3 majorproducers of platforms for next generation sequencing).

The biological sample (e.g. a sample of mRNA molecules) is firstrandomly fragmented into short molecules.

Then the ends of the fragments are adenylated and adaptor oligosare attached to the ends.

Page 4: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Next generation sequencing

The fragments are then size selected, purified and put on a flowcell.

An Illumina flow cell has 8 lanes and is covered with otheroligonucleotides that bind to the adaptors that have been ligatedto the fragmented nucleotide molecules from the sample.

The bound fragments are then extended to make copies and thesecopies bind to the surface of the flow cell.

This is continued until there are many copies of the originalfragment resulting in a collection of hundreds of millions ofclusters.

Page 5: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed
Page 6: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed
Page 7: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Next generation sequencing

The reverse strands are then cleaved off and washed away andsequencing primer is hybridized to the bound DNA.

The individual clusters are then sequenced in parallel, base by base,by hybridizing fluorescently labeled nucleotides.

After each round of extension of the nucleotides a laser excites allof the clusters and a read is made of the base that was just addedat each cluster.

If a very short sequence is bound to the flow cell it is possible thatthe machine will sequence the adaptor sequence-this is referred toas adaptor contamination.

There is also a measure of the quality of the read that is savedalong with the read itself.

Page 8: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Next generation sequencing

These quality measures are on the PHRED scale, so if there is anestimated probability of an error of p, the PHRED based score is−10 log10 p.

If we fragment someone’s DNA we can then sequence thefragments, and if we can then put the fragments back together wecan then get the sequence of that person’s genome ortranscriptome.

This would allow us to determine what alleles this subject has atevery locus that displays variation among humans (e.g. SNPs).

Page 9: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Next generation sequencing

There are a number of popular algorithms for putting all of thefragments back together: BWA, Maq, SOAP, ELAND and Bowtie.

We’ll discuss Bowtie in some detail later (BWA uses the sameideas as Bowtie).

ELAND is a proprietary algorithm from Illumina and Maq andSOAP use hash tables and are considerably slower than Bowtie.

Page 10: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Applications

There are many applications of this basic idea:

1. resequencing (DNA-seq)

2. gene expression (RNA-seq)

3. miRNA discovery and quantitation

4. DNA methylation studies

5. ChIP-seq studies

6. metagenomics

7. ultra-deep sequencing of viral genomes

We will focus on resequencing (and SNP calling) and geneexpression in this course.

Page 11: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

NGS and microarrays

Some have talked about the death of microarrays due to the use ofNGS in the context of gene expression studies, but this hasn’thappened yet.

Compared to microarrays, RNA-seq

1. higher sensitivity

2. higher dynamic range

3. lower technical variation

4. don’t need a sequenced genome (but it helps, a lot)

5. more information about different exon uses (more than 90% ofhuman genes with more than 1 exon have alternative isoforms)

Page 12: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

NGS and microarrays

In one head to head comparison, 30% more genes are found to bedifferentially expressed with the same FDR (Marioni, Mason,Mane, et al. (2008)).

These authors also found that the correlation between normalizedintensity values from a microarray and the logarithm of the countsfor the same transcript were 0.7-0.8 (this is similar to what othershave reported).

The largest differences occurred when the read count was low andthe microarray intensity was high which probably reflectscross-hybridization in the microarray.

Page 13: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

NGS and microarrays

Others have found slightly higher correlations, here is a figure froma study by Su, Li, Chen, et al. (2011).

Page 14: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Study design

Before addressing the technical details, we will outline someconsiderations regarding study design specific to NGS technology.

There are several major manufacturers of the technology necessaryto generate short reads, and they all have different platforms withresulting differences in the data and processing methods. A few ofthe major vendors are

1. Illumina, formerly known as Solexa sequencing

2. Roche, which owns 454 Life Sciences-this is beingdiscontinued

3. Life technologies, which supports the SOLiD platform

Page 15: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Study design

We will focus on the Illumina technology as the University ofMinnesota has invested in this platform (the HiSeq 2000 andMiSeq platforms is available in the UMN Genomics Center).

The MiSeq machine produces longer reads than the HiSeq 2000

We also have some capabilities to do 454 sequencing here.

This platform produces longer reads (up to several hundred andgrowing), and is very popular in the microbiomics literature.

The director of UMG center is giving a talk this Friday 2:30-3:30(Keller 3-125) on their capabilities, with an emphasis onmicrobiomics.

Page 16: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Study design

By sequencing larger numbers of fragments one can estimate thesequence more accurately.

The goal of using high sequence coverage is that by using morefragments it is more likely that one fragment will cover anyrandom location in the genome.

Clearly one wants to cover the entire genome if the goal is tosequence it, hence the depth of coverage is how many overlappingfragments will cover a random location.

By deep sequencing, we mean coverage of 30X to 40X, whereascoverage of 4X or so would be considered low coverage.

Page 17: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Study design

The rule for determining coverage is:

coverage = (length of reads × number of reads)/ (length ofgenome)

For a given sequencing budget, there is a tradeoff between thedepth of coverage and the number of subjects one can sequence.

The objectives of the study should be considered when determiningthe depth of coverage.

If the goal is to accurately sequence a tumor cell then deepcoverage is appropriate, however if the goal is to identify rarevariants, then lower coverage is acceptable.

Page 18: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Study design

For example, the 1000 Genomes project is using about 4Xcoverage.

A reasonable rule of thumb is that one probably needs 20-30Xcoverage to get false negatives less than 1% in nonrepeat regionsof a duploid genome (Li, Ruan and Durbin, 2008).

If the organism one is sequencing does not have a sequencedgenome, the calculations are considerably more complex and onewould want greater coverage in this case.

Throughout this treatment we will suppose that the organismunder investigation has a sequenced genome.

Page 19: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Study design

Another choice when designing your study regards the use ofmate-pairs.

With the Illumina system one can sequence both ends of afragment simultaneously.

This greatly improves the accuracy of the method and is highlyrecommended.

If the goal is to characterize copy number variations then gettingpaired end sequences is crucial.

Page 20: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

We can use the ShortRead package in R to conduct qualityassessment and get rid of data that doesn’t meet quality criteria.

To explore this we will use some data from dairy cows.

You can find a description of the data along with the actual datain sra format at this site.

http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP012475

While the NCBI uses the sra format to store data, the standardformat for storing data that arises from a sequencing experiment isthe fastq format.

We can convert these files to fastq files using a number oftools-here is how to use one supported at NCBI called fastq-dump.

Page 21: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

To use this you need to download the compressed file, unpack itand then enter the directory that holds the executable and issuethe command (more on these steps later).

sratoolkit.2.1.15-centos linux64/bin/fastq-dump SRR490771.sra

Then this will generate fastq files which hold the data.

If we open the fastq files we see they hold identifiers andinformation about the reads followed by the actual reads:

Page 22: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

@SRR490756.1 HWI-EAS283:8:1:9:19 length=36ACGAGCTGANGCGCCGCGGGAAGAGGGCCTACGGGG+SRR490756.1 HWI-EAS283:8:1:9:19 length=36AB7A5@A47%8@2@@296263@/.(3’=49<9<8,[email protected] HWI-EAS283:8:1:9:1628 length=36CCTAATAATNTTTTCCTTTACCCTCTCCACATTAAT+SRR490756.2 HWI-EAS283:8:1:9:1628 length=36BCCC@CB>)%)43?BBCB@:>[email protected] HWI-EAS283:8:1:9:198 length=36ATTGAATTTGAGTGTAAACATTCACATAAAGAGAGA+SRR490756.3 HWI-EAS283:8:1:9:198 length=36BBB<9ABCA<A2?5>BBB?BBB@BABB>?A*?9?<B

The string of letters and symbols encodes quality information foreach base in the sequence.

Page 23: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

Then we define the directory where the data is and read in all ofthe filenames that are there

> library(ShortRead)> dataDir <- "/home/cavan/Documents/NGS/bovineRNA-seq"> fastqDir <- file.path(dataDir, "fastq")> fls <- list.files(fastqDir, "fastq$", full=TRUE)> names(fls) <- sub(".fastq", "", basename(fls))

Page 24: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

Then we apply the quality assessment function, qa, to each of thefiles after we read it into R using the readFastq function

> qas <- lapply(seq along(fls), function(i, fls)+ qa(readFastq(fls[i]), names(fls)[i]), fls)

This can take a long time as it has to read all of these files into R,but note that we do not attempt to read the contents of the filesinto the R session simultaneously (a couple of objects generated bythe call to readFastq stored inside R will really bog it down).

Then we rbind the quality assessments together and save to anexternal location.

> qa <- do.call(rbind, qas)

> save(qa, file=file.path("/home/cavan/Documents/NGS/bovineRNA-seq","qa.rda"))

> report(qa, "/home/cavan/Documents/NGS/bovineRNA-seq/reportDir")

[1] "/home/cavan/Documents/NGS/bovineRNA-seq/reportDir/index.html"

Page 25: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

This generates a directory called reportDir and this directory willhold images and an html file that you can open with a browser toinspect the quality report.

So we now use the browser to do this-we see the data from run 12is of considerably higher quality than the others but nothing is toobad.

One can implement soft trimming of the sequences to get rid ofbasecalls that are of poor quality.

A function to do this is as follows (thanks to Jeremy Leipzig whoposted this on his blog).

Note that this can require a lot of computational resources foreach fastq file.

Page 26: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

# softTrim

# trim first position lower than minQuality and all subsequent positions

# omit sequences that after trimming are shorter than minLength

# left trim to firstBase, (1 implies no left trim)

# input: ShortReadQ reads

# integer minQuality

# integer firstBase

# integer minLength

# output: ShortReadQ trimmed reads

library("ShortRead")

softTrim <- function(reads, minQuality, firstBase=1, minLength=5){qualMat <- as(FastqQuality(quality(quality(reads))),’matrix’)

qualList <-split(qualMat, row(qualMat))

ends <- as.integer(lapply(qualList, function(x){which(x <

minQuality)[1]-1}))# length=end-start+1, so set start to no more than length+1 to avoid

# negative-length

starts <-as.integer(lapply(ends,function(x){min(x+1,firstBase)}))# use whatever QualityScore subclass is sent

newQ <-ShortReadQ(sread=subseq(sread(reads),start=starts,end=ends),

quality=new(Class=class(quality(reads)),

quality=subseq(quality(quality(reads)),

start=starts,end=ends)),id=id(reads))

Page 27: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Quality assessment

# apply minLength using srFilter

lengthCutoff <- srFilter(function(x) { width(x)>=minLength},name="length cutoff")

newQ[lengthCutoff(newQ)]

}

To use:

> library("ShortRead")> source("softTrimFunction.R")> reads <- readFastq("myreads.fq")> trimmedReads <- softTrim(reads=reads,minQuality=5,firstBase=4,minLength=3)

> writeFastq(trimmedReads,file="trimmed.fq")

Page 28: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Bowtie background

The keys to understanding how the algorithm implemented byBowtie are the Burrows Wheeler transform (BWT) and the FMindex.

1. the FM index: see Ferragina, P. and Manzini, G. (2000),“Opportunistic data structures with applications”

2. the Burrows Wheeler transform (BWT): see Burrows, M. andWheeler, D.J. (1994), “A Block-sorting lossless datacompression algorithm”.

The Burrows Wheeler transform is an invertible transformation of astring (i.e. a sequence of letters).

Page 29: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Computing the BWT

To compute this transform, one

1. determines all rotations of the string

2. sorts the rotations lexicographically to generate an array M

3. saves the last letter of each sorted string in addition to therow of M that corresponds to the original string.

Page 30: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Computing the BWT: an example

Some descriptions of the algorithm introduce a special character(e.g. $) that is lexicographically smaller than all other letters, thenappend this letter to the string before starting the rest of thetransform.

This just guarantees that the first row of M is the original string sothat one needn’t store the row of M that is the original string.

To see how the BWT works, let’s consider the consensus sequencejust upstream of the transcription start site in humans (fromZhang 1998):

GCCACCWe will use the modification that introduces the special character.

Page 31: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

First add the special character to the end of the sequence we wantto transform, denoted S with length N.

GCCACC$then get all of the rotations:

$GCCACCC$GCCACCC$GCCAACC$GCCCACC$GCCCACC$GGCCACC$

Page 32: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

then sort them (i.e. alphabetize these sequences)$GCCACCACC$GCCC$GCCACCACC$GCCC$GCCACCACC$GGCCACC$

so the BWT is just the last column, typically denoted L, and wewill refer to element i of this array as L[i ] for i = 1, . . . ,N andranges of elements via L[1, . . . , k].

L=CCCCAG$

Page 33: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

Importantly one can invert this transformation of the string toobtain the original string.

This was first developed in the context of data compression, so theidea was to transform the data, then compress it (which can bedone very efficiently after applying this transformation), then onecan invert the transformation after uncompressing.

Page 34: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

Suppose we have the BWT of a genome and want to count thenumber of occurrences of some pattern, e.g. AC, our querysequence.

We can do this as follows.

First, create a table that for each letter has the count of thenumber of occurrences of letters that are lexically smaller than thatletter. For us this is just

x $ A C GC (x) 0 1 2 6

Page 35: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

Next we need to be able to determine the number of times that aletter x occurs in L[1, . . . , k] for any k , which we will denoteO(x , k).

Note that all of these calculations can be done once and stored forfuture use.

Now search for the last letter of our query sequence, which is C,over the range of rows given by [C (x) + 1, . . . ,C (x + 1)] where xis C so the range is [3,6].

Note that these are the rows of M that start with a C although wedon’t actually need to store all of M to determine this (it justfollows from the alphabetization).

Page 36: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

Then process the second to last letter, which for our example is A.

To determine the range of rows of M to search we look at[C (x) + O(x , s1 − 1) + 1, . . . ,C (x) + O(x , s2)] where x is A and s1

and s2 are the endpoints of the interval from the previous step.Now

O(A, 3− 1) = O(A, 2) = 0

and

O(A, 6) = 1

so the interval is [1 + 0 + 1, . . . , 1 + 1] = [2, 2] then since this isthe end of the query sequence we determine the number ofoccurrences by looking at the size of the range of this interval,which is here just 1.

Page 37: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

If the range becomes empty or if the lower end exceeds the upperend the query sequence is not present.

For example if we query on CCC and proceed as previously, the firststep is the same (since this also ends in C), so then searching for Cagain we find

[C (C) + O(C, s1 − 1) + 1, C (x) + O(C, s2)]

but O(C , 2) = 2 and O(C , 6) = 4so the interval is

Page 38: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

[2 + 2 + 1, 2 + 4] = [5, 6]which has size 2 (as there are 2 instances of CC in the sequence),so then proceed as in step 2 again to get

[C (C) + O(C, 5− 1) + 1, C (C) + O(C, 6)]

which is

[2 + 4 + 1, 2 + 4] = [7, 6]

indicating that this sequence is not present.

Page 39: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

For aligning short sequences to a genome we need to know wherethe short segments align, hence we need to determine the locationof the start of the sequence in the genome.

If the locations of some of the letters are known we can just traceback to those locations as follows.

First, note that we know the first column of the array M. Weknow this due to 2 facts:

1. the first column will be strictly alphabetized

2. every letter in the sequence is represented in the last columnL.

Page 40: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

For our problem we have L=CCCCAG$ and so the first column isjust F=$ACCCCG.

As our sequence is of length N, then if we knew the permutationof 1, . . . ,N that mapped F to L we would know the entiresequence (as we will see shortly).

That one can determine this mapping is the key ingredient incomputing the inverse BWT. Call this map T where we have

F(T (i)

)= L(i).

Page 41: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

To find T , first consider M and a new array, call it M ′, that isobtained from M by taking the last element from each row andmaking it the first element but otherwise changing nothing.

Here is M ′:C$GCCACCACC$GCCC$GCCACCACC$GACC$GCCGCCACC$$GCCACC

Page 42: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Inverting the BWT

Note that every row of M has a corresponding row in M ′ and likeM M ′ has all cyclic rotations of the original sequence S .

Also note that the relative order of the rows with the same firstletter are the same in both arrays: rows 3, 4, 5 and 6 in M becomerows 1, 2, 3 and 4 in M ′. We can use this insight to determine themapping from F to L, i.e. T .

To determine T , we proceed sequentially as follows. L(1) = C andthis is the first C in L so T (1) = i where F (i) is the first C in F solooking at F we see that i = 3.

Page 43: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

Then continue with this thinking: T (2) = i and L(2) is the secondC in L, but then looking at F we see that the second C in F occursat position 4, so that T (2) = 4.

Continuing in this fashion we find that

T = (3, 4, 5, 6, 2, 7, 1).

Page 44: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

We can then reconstruct the sequence in reverse order.

We know the last element of the sequence is the last element ofthe first row of M which is the first element of L, which is here C.

Then we proceed sequentially using the mapping T as follows

S(N − i) = L(T i (1)

)where T i (x) = T i−1

(T (x)

), i.e. repeatly apply the mapping T to

itself.

Page 45: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BWT example

so

S(N − 1) = L(T (1)

)= L(3) = C

S(N − 2) = L(T (3)

)= L(5) = A

S(N − 3) = L(T (5)

)= L(2) = C

S(N − 4) = L(T (2)

)= L(4) = C

S(N − 5) = L(T (4)

)= L(6) = G

If we were just trying to find the location of a particular match,like AC we could just stop at that point rather than recover theentire sequence.

Page 46: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

FM index

The FM index is a compressed version of L, C and O along with acompressed version of a set of indices with known locations in S .

This allows one to trace back to a known location.

So one could align short reads to a genome by first computing theBWT and compressing the results.

Unfortunately many of the reads would not align due to sequencingerrors and polymorphisms.

Hence one needs to allow some discrepancies between the genomeand the short reads.

While this is adjustable, the default specifications for the programBowtie are that there is a maximum of 2 mismatches in the first 28bases and a maximum total quality score of 70 on the PHREDscale.

Page 47: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

In order to use the latest tools in bioinformatics you need access toa linux based operating system (or Mac OS X).

So we will give some background on how to use a linux basedsystem.

The linux operating system is based on the UNIX operating system(by a guy named Linus Torvalds), but is freely available (UNIX wasdeveloped by a private company).

There are a number of implementations of linux that go bydifferent names: e.g. Ubuntu, Debian, Fedora, and OPENsuse.

Page 48: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

Most beginners choose Ubuntu (or one of its offshoots, like Mint).

I use Fedora on my computers and centos is used on the server wewill use for this course.

You can install it for free on any PC.

Such systems are based on a hierarchical structure for organizingfiles into directories.

Much of the time we will use plain text files: these are files thathave no formatting, unlike the files you use in a typical wordprocessing program.

To edit such files you need to use a text editor. I like vi and vim(which has capabilities to interact with large files).

Page 49: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

The editor pico is simple to use and installed on the server we willuse for this class.

To use this you type pico followed by the file name-if the filedoesn’t exist it will create the file otherwise it will open up the fileso that you can edit it.

For example

[cavanr@boris ~]$ pico temp1

will open the file temp1 for editing. Commands at the bottom ofthe window tell you how to conduct various operations.

Page 50: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

You can create subdirectories in your home directory by typing

[cavanr@boris ~]$ mkdir NGS

This would create a NGS subdirectory where you can put filesrelated to your NGS project. Then to keep your microarray dataseparate you could do

[cavanr@boris ~]$ mkdir microarray

To move around hierarchical directory structure you use thecommand cd. We enter the NGS directory with

[cavanr@boris ~]$ cd NGS

Page 51: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

To go back to our home location (i.e. the directory you start fromwhen you log on) you just type cd and if you want to go up onelevel in the hierarchy you type cd ...

To list all of the files and directories in your current location youtype ls.

To copy a file you use cp, to move a file from one location toanother you use mv and to remove a file you use rm.

One can modify the behavior of these commands by using options.

Page 52: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

For example, if I want to see the size of the files (and someadditional information) rather than just list them I could issue thecommand

[cavanr@boris ~]$ ls -l

Other useful commands include more and less for viewing thecontents of files and man for information about a command (fromthe manual). For instance

[cavanr@boris ~]$ man ls

Will open a help file with information about the options that areavailable for the ls command.

Page 53: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

Wild cards are characters that operating systems use to hold avariable value. These are extremely useful as it lets the user specifymany tasks without having to do a bunch of repetitive typing.

For example, suppose you had a bunch of files that ended with .txtand you wanted to delete them. You could issue commands like

[cavanr@boris ~]$ rm file1.txt[cavanr@boris ~]$ rm file2.txt[cavanr@boris ~]$ rm file3.txt ...

Or you could just use

[cavanr@boris ~]$ rm file*txt

or even

[cavanr@boris ~]$ rm *txt

Page 54: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

Most linux configurations will check that you really want to removethe files, but if you are removing many files, the process of typingyes repeatly is unnecessary. Just type

[cavanr@boris ~]$ rm -f *txt

When you obtain programs from the internet that run on linuxsystems these programs will typically install many subdirectories,so if you want to get rid of all the contents of a directory, firstenter that directory,

[cavanr@boris ~]$ cd oldDir

Page 55: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

then issue the command

[cavanr@boris oldDir]$ rm -rf *

then go back up one level

[cavanr@boris oldDir]$ cd ..

and get rid of the directory using the rmdir command

[cavanr@boris ~]$ rmdir oldDir

If you issue a command to run a program but wish you hadn’t andwant it to stop you push the Ctrl and C keys simultaneously.

Page 56: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

To connect to a linux server from a Windows machine you need aWindows program that allows you to log on to a terminal and theability to transfer files to and from the remote server and yourWindows machine.

WinSCP is a free program that allows one to do this.

You can download the data from the WinSCP website and itinstalls like a regular Windows program: just allow your computerto install it and go with the default configuration.

To get a terminal window you can use a program called PuTTY.

So download the program PuTTY (which is just the file calledPuTTY.exe) and create a subdirectory in your Program Filesdirectory called PuTTY.

Page 57: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Introduction to Linux

Then save PuTTY.exe in the directory you just created.

Once installed start WinSCP and use boris.ccbr.umn.edu as thehostname along with your user name and password.

You can then get a terminal window using putty: just click on thebox that shows 2 computers that are connected right below thesession tab.

Then disconnect when you are done (using the option in thesession tab).

Page 58: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Installation of Bowtie

Recall: Bowtie is a program that aligns short reads to an indexedgenome.

Bowtie2 is a more up to date version that allows gappedalignments and Smith Waterman type scoring, Bowtie just allows acouple of nucleotides to disagree between the read and the genome.

To use Bowtie we need a set of short reads (in fastq format) andthe index of a genome and of course the program itself.

We will download a compressed version of the program (anexecutable) and after we uncompress it, we can use the executable,i.e. a program that does something.

Bowtie has the capability to build indexes if we have the genomesequence.

Page 59: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Installation of Bowtie

First set up a directory to hold all of this work

[cavanr@boris ~]$ mkdir RNAseq

then create a directory to hold all of the programs we willdownload and set up, we will call it bin (for binaries).

[cavanr@boris ~]$ mkdir bin

Then enter this directory so we can install some programs.

[cavanr@boris ~]$ cd bin

Page 60: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Installation of Bowtie

In order to know what version of software you need to install youneed to check the architecture of the computer. To do so we issuethe following command

[cavanr@boris bin]$ uname -mx86 64

So this is a 64 bit machine (32 bit machines will report somethinglike i386). So get the appropriate version of Bowtie using thewget command

[cavanr@boris bin]$ wget

http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.1.0/bowtie2-2.1.0-linux-x86 64.zip

Page 61: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Installation of Bowtie

then unpack it

[cavanr@boris bin]$ unzip bowtie2-2.1.0-linux-x86 64.zipArchive: bowtie2-2.1.0-linux-x86 64.zipcreating: bowtie2-2.1.0/creating: bowtie2-2.1.0/scripts/inflating: bowtie2-2.1.0/scripts/build test.shinflating:bowtie2-2.1.0/scripts/make a thaliana tair.sh...

This will generate a subdirectory with the name bowtie2-2.1.0,so enter that subdirectory so that you can have access to theexecutable called bowtie in this subdirectory.

Page 62: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Aligning reads to an indexed genome

The program ships with some data from the bacteriophageEnterobacteria phage lambda, so we will check the installationwith this.

We will first use a fasta file that ships with Bowtie to construct theindex files for this organism.

To use the program, you type the name of the executable, then givethe location and name of the fasta file, then give the index name.

So to test the installation type the following

[cavanr@boris bowtie2-2.1.0]$ ./bowtie2-build

example/reference/lambda virus.fa lambda virus

Page 63: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Aligning reads to an indexed genome

Then you get a bunch of output written to the screen about theindex creation process and it generates 6 new files:lambda virus.1.bt2, lambda virus.2.bt2,lambda virus.3.bt2, lambda virus.4.bt2,lambda virus.rev.1.bt2 and lambda virus.rev.2.bt2.

After that we are ready to align reads to the reference genomeusing the index we constructed.

Here is syntax is to give the name of the index after -x, the fileswith the reads after -U and here we specify that we want theoutput in the sam file format.

Page 64: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Aligning reads to an indexed genome

[cavanr@boris bowtie2-2.1.0]$ ./bowtie2 -x lambda virus -U

example/reads/reads 1.fq -S eg1.sam

10000 reads; of these:

10000 (100.00%) were unpaired; of these:

596 (5.96%) aligned 0 times

9404 (94.04%) aligned exactly 1 time

0 (0.00%) aligned >1 times

94.04% overall alignment rate

We can also look at the output.

[cavanr@boris bowtie2-2.1.0]$ head eg1.sam

Page 65: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Aligning reads to an indexed genome

Illumina stores prebuilt indices for many commonly studiedorganisms at the locationhttp://support.illumina.com/sequencing/sequencing software/igenome.ilmn

If one has a set of index files one can use the programbowtie2-inspect to recreate the fasta files.

The Illumina site has the fasta files for these organisms in the samelocation.

Page 66: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Altering your $PATH

Note that we need to be in the same directory level as ourexecutable in order to use that executable thus far-this is notnecessary.

To have access to an executable anywhere in your system youshould alter your PATH variable in the linux environment to includewhere you keep your executables then write the executables to yourpath.

If you look at your path variable you will see that the bin directorywe created is already on the path, so we can just copy executablesto that location to use them.

Here is how to examine the PATH variable.[user0001@boris bowtie2-2.1.0]$ echo $PATH/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/export/home/courses/ph7445/user0001/bin

Page 67: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Altering your $PATH

Then copy the executables: note that by copying the name of theexecutable without the ./ at the start we can just use the name ofthe executable, i.e.[cavanr@boris bowtie2-2.1.0]$ cp bowtie2 $HOME/bin/./bowtie2

[cavanr@boris bowtie2-2.1.0]$ cp bowtie2-align $HOME/bin/./bowtie2-align

[cavanr@boris bowtie2-2.1.0]$ cp bowtie2-build $HOME/bin/./bowtie2-build

[cavanr@boris bowtie2-2.1.0]$ cp bowtie2-inspect $HOME/bin/./bowtie2-inspect

Page 68: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Altering your $PATH

To test that this works let’s go back to our home directory andissue our Bowtie2 commands again (note that we need to tellBowtie where to find the fastq and index files)

[cavanr@boris ~]$ bowtie2 -x$HOME/bin/bowtie2-2.1.0/lambda virus -U$HOME/bin/bowtie2-2.1.0/example/reads/reads 1.fq -Seg1.sam

and it works fine.

Page 69: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Producing SAM files

As we have seen, Bowtie2 can also produce a special file type thatis widely used in the NGS world, a SAM file.

If we open our eg1.sam file we see that the first 3 rows are theheader (since they start with @).

Then there is a row for each read and 11 mandatory tab delimitedfields per line that provide information about the alignment, forexample the 4th column gives the location where the read maps.

For the full specification consultsamtools.sourceforge.new/SAMv1.pdf.

Page 70: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

SAMtools

The SAM (for sequence alignment/map) file type has a very strictformat.

These files can be processed in a highly efficient manner using asuite of programs called SAMtools.

Here is how to setup these tools

Download the latest version in your bin directorysamtools-0.1.19.tar.bz2.

[cavanr@boris bin]$ wget

http://sourceforge.net/projects/samtools/files/samtools/0.1.19/samtools-0.1.19.tar.bz2

Page 71: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

SAMtools

then unpack

[cavanr@boris bin]$ tar -jxvf samtools-0.1.19.tar.bz2

then there is a directory full of C programs called samtools-0.1.19,so go in there and type

[cavanr@boris samtools-0.1.19]$ make

and write the executable to your path

[cavanr@boris samtools-0.1.19]$ cp samtools$HOME/bin/./samtools

Page 72: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

SAMtools example

To use this for variant calling, we need an example with at least 2subjects, so let’s use the example that ships with SAMtools.

We will return to this example latter, but for now we will just takea look at the “pileup” which represents the extent of coverage.

Here is how to use SAMtools to index a FASTA file:

[cavanr@boris samtools-0.1.19]$ samtools faidxexamples/ex1.fa

Page 73: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BAM and SAM files

then create a BAM file from the SAM file (BAM files are binary, sothey are compressed but not readable by eye)

[cavanr@boris samtools-0.1.19]$ samtools import examples/ex1.fa.fai

examples/ex1.sam.gz ex1.bam

should get

[sam header read2] 2 sequences loaded

Page 74: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BAM and SAM files

then index the BAM file

[cavanr@boris samtools-0.1.19]$ samtools indexex1.bam

then we can view the alignment to visually examine the extent ofcoverage and the depth of coverage

[cavanr@boris samtools-0.1.19]$ samtools tviewex1.bam examples/ex1.fa

and hit q to leave that visualization.

Page 75: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Visualization of the pileup

1 11 21 31 41 51 61 71

CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCT

................................................................................

.....................................A..............................C. ........

......................................................C................

................................... ................................... ..

..............T.T.T..T........T..TC. ................................

................................... ...............................

...........................C....... ..............................

.................................... .............................

...........................T....... ......................

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....................

...............................T... .....................

................................... ....................

.................................... ....................

..............................T.... ..............

..............................................

.....................................

..................

.

Page 76: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq

As was mentioned previously, we can use next generationsequencing to generate sequences from samples of RNA.

The transcriptome is the collection of all transcripts that can begenerated from a genome.

Current estimates indicate that 92-94% of human transcripts withmore than one exon have alternatively spliced isoforms (Wang,Sandberg, Luo, et al. Nature, 2008).

This means that the same gene can give rise to multiple mRNAmolecules.

Some will differ because they have have different exons assembledinto an mRNA, but other possibilities include differences in use ofthe transcription start site (TSS), and differences in the 5’ and 3’untranslated regions (UTRs).

Page 77: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Aligning reads to the transcriptome

Differences in exon usage will result in mRNAs from the same geneleading to different proteins, hence we speak of differential codingDNA sequences (CDS).

One of the differences that arises due to differential 3’UTR usageis alternative polyadenelation patterns.

These differences in 5’ and 3’ UTR usage lead to mRNA moleculesthat have differential affinity for proteins that bind to the maturemRNA molecule and we are just starting to understand howbiological processes are regulated by these subtle mechanisms.

Page 78: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq

Note that there is a difference in aligning a set of reads from anRNA sample and a DNA sample: the RNA molecules will haveundergone removal of introns hence they won’t align directly to thegenome.

If there is an annotated transcriptome for an organism you shouldstart out trying to use this to analyze your data.

However most investigations find evidence for transcripts that arenot currently annotated, even for well studied organisms like miceand humans.

Page 79: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Duplicates

One of the differences between using next generation sequencingfor resequencing and profiling RNA levels is that with librariesprepared from RNA, one often finds duplicate reads.

In fact it is not unusual to find that 60-80% of the reads areduplicates.

Page 80: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Duplicates

In a resequencing study the chances of finding duplicates is quitelow: for example, the human haploid genome is about 3 billion bpsso the probability of observing a duplicate is about 1 in 3 billion, sowith 20-40 million reads, duplicate reads seem unlikely.

Duplicate reads can arise as a PCR artifact, hence in resequencingstudies duplicate reads should probably be treated as artifacts andremoved.

There is a SAMtools function called rmdup that can be used to dothis, but read the manual on its use: for example, it is designed forpaired end reads by default.

Page 81: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Library complexity

However, when we consider RNA-Seq libraries, duplicates mightarise just due to low complexity of our sample.

By low complexity we mean there are not many distinct RNAmolecules.

The median length of a human transcript is 2186 nucleotides(according to the UCSC hg19 annotation), so if one has a librarywith only 1 type of transcript and generates 20 million reads fromit one is guaranteed to find duplicates.

So for analysis of RNA-Seq experiments it is probably unwise toremove duplicates.

Page 82: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Alignment algorithms

There are a number of algorithms that are designed to align a setof short reads to the transcriptome, here is a short non-exhaustivelist.

1. TopHat

2. GSNAP

3. MapSplice

4. SpliceMap

5. HMMSplicer

However it is not clear at this point if any of these really dominatethe others in terms of performance.

However, TopHat is perhaps the most popular currently, and as itbuilds off of Bowtie, we will use this program in this course.

Page 83: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

The basic idea behind TopHat is that if 2 segments of a read mapto different locations on the genome then they are likely mappingto 2 different exons.

To use TopHat you need fastq files for all of our samples and thegenome sequence for the organism.

Note that sometimes the data for a single sample will be in morethan one fastq file.

While we can use Bowtie to align short reads to a genome, readsthat span an intron, or genes that span a splice junction, will notalign directly to the genome.

Page 84: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

If a read contains an internal segment that should map to an exon,Bowtie will fail to map this read to its correct genomic location asit won’t match the genome outside of this exon.

For this reason, Tophat initially splits the reads up into 25 baseunits, does the read mapping, then reassembles reads.

Note: if you are using a set of reads that are shorter than 50 basesyou should instruct TopHat to break the reads into fragments thatare about half the length of your reads.

For example, if you have a set of Illumina reads that are 36 basesyou should instruct TopHat to break them into fragments of length18 (using the --segment-length options, like--segment-length 18), otherwise it may not detect any splicejunctions.

Page 85: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

TopHat starts out by mapping the reads to the genome usingBowtie and sets aside those reads that don’t map.

It then constructs a set of disjoint covered regions (it no longeruses the program Maq as it did at the time of the publication):these are taken to be the initial set of exons observed in thatsample.

The program determines a consensus sequence using the reads,however in low coverage regions it will use the reference genome todetermine the sequence.

Also, as the ends of exons are likely only mapped by reads thatspan the splice junction, the disjoint regions covered by reads areexpanded (45 bases by default) using the reference sequence.

Page 86: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

For genes that are transcribed at low levels there could be gaps inthe exons after the assembly step, hence exons that are very closeare merged into a single exon.

Introns of length less than 70 bases are rare in mammaliangenomes, hence by default we combine exons that are separated byless than this many bases (this can be adjusted).

The resulting set of putative exons (with their extensions from thereference genome) we will call islands.

The 5’ end of an intron is called a donor site and the 3’ end iscalled the acceptor site.

Page 87: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

This donor site almost always has the 2 bases GU at the end whilethe acceptor site is almost always AG, but these are sometimesGC-AG and AU-AC so current versions of TopHat also look forthese acceptor and donor sites.

TopHat uses these sequence characteristics to find potential donorand acceptor site pairs in the island sequences (not just adjacentones, but nearby ones to allow for alternative splicing).

It then looks at the set of reads that were initially unmapped andexamines if any of these map to the sequence that spans the splicejunction.

To conduct this mapping TopHat uses the fact that most intronsare between 70 and 500,000 bases (the maximum differs from thepublication): these intron sizes are based on mammals, so the usershould specify values for these if using a different organism.

Page 88: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

It also requires that at least one read span at least 8 bases on eachside of the splice junction by default, but this value can beadjusted.

It also only uses the first 28 bases from the 5’ end of the read toensure that the read is of high quality, but this too can be adjusted.

Finally the program reports a splice junction only if it occurs inmore than 15% of the depth of coverage of the flanking exons (the15% value comes from the Wang et al. 2008, Nature paper), butthis is also adjustable as this is based on human data.

If one supplies a transcriptome to TopHat via a GTF file, theprogram will first map reads to the transcriptome then try toidentify splice junctions for the unmapped reads.

Page 89: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

One can disable the subsequent search for splice junctions byspecifying the -T option if you have provided a transcriptome.

If you are going to use an annotated transcriptome then TopHatwill need to construct an index for this and if you are processingmultiple samples TopHat will redo exactly the same calculation foreach.

To save time, you should have TopHat do these calculations forthe first sample and save the results.

You can do this using the --transcriptome-index option. Thisoption expects a directory and a name to use in this directory.

Page 90: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

The exact syntax is like this: note that the--transcriptome-index option must be used right after the -Goption.

tophat -o out sample1 -G known genes.gtf \--transcriptome-index=transcriptome data/known \hg19 sample1 1.fq.z

Then for processing the other samples we just tell TopHat wherethese indices are and don’t specify the -G option

tophat -o out sample2 \--transcriptome-index=transcriptome data/known \hg19 sample2 1.fq.z

Page 91: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

To use the program first get the latest version

[cavanr@boris bin]$ wget http://tophat.cbcb.umd.edu/downloads/tophat-2.0.9.Linux x86 64.tar.gz

Then uncompress the binaries, enter the directory that is generatedand copy the executables to your path.

[cavanr@boris bin]$ tar zxvf tophat-2.0.9.Linux x86 64.tar.gz

[cavanr@boris tophat-2.0.9.Linux x86 64]$ cp tophat $HOME/bin/./tophat

[cavanr@boris tophat-2.0.9.Linux x86 64]$ cp tophat2 $HOME/bin/./tophat2

[cavanr@boris tophat-2.0.9.Linux x86 64]$ cp * $HOME/bin/./

Page 92: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

Then you should test the installation, so get some test data.

[cavanr@boris tophat-2.0.9.Linux x86 64]$ wget

http://tophat.cbcb.umd.edu/downloads/test data.tar.gz

Then uncompress, go into that directory and run a test

[cavanr@boris tophat-2.0.6.Linux x86 64]$ tar zxvf test data.tar.gz

[cavanr@boris tophat-2.0.6.Linux x86 64]$ cd test data

Then test by issuing the command (the -r 20 option indicatesthat we think there are 20 bases between read pairs)

[cavanr@boris test data]$ tophat -r 20 test ref reads 1.fq reads 2.fq

Page 93: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

TopHat

and if it is working correctly this yields the following

[2013-11-11 12:57:46] Beginning TopHat run (v2.0.9)-----------------------------------------------[2013-11-11 12:57:46] Checking for Bowtie

Bowtie version: 2.1.0.0[2013-11-11 12:57:46] Checking for Samtools

Samtools version: 0.1.19.0[2013-11-11 12:57:46] Checking for Bowtie index files(genome)..

Found both Bowtie1 and Bowtie2 indexes....-----------------------------------------------[2013-11-11 12:57:47] Run complete: 00:00:01 elapsed

Page 94: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Cufflinks

While TopHat is useful for aligning a set of reads to thetranscriptome, one must use some other program to generatetranscript level summaries of gene expression and test fordifferences between groups.

One such program that is supported by the same group ofresearchers that develop TopHat is Cufflinks.

Cufflinks uses the output from TopHat and we will see that onecan use functionality of the R package cummeRbund to examinethe output and generate reports and publication quality graphs.

It is designed for paired end reads but can be used for unpairedreads.

Page 95: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Cufflinks

Cufflinks uses an expression for the likelihood of a set of paired endreads that depends on the length of the fragments: with pairedend reads one can estimate the distribution of the lengths.

Note that there is a fundamental identifiability problem withtranscript assembly that is only mitigated by the presence of readsthat span splice junctions.

Suppose there are 2 exons and some reads map to both exons butwe are unable to identify any reads that span splice junctions.

If there are 2 exons there are 3 possible transcripts (2 with one ofeach exon and a third that uses both) so if we allow for all 3transcripts there are an infinite collection of combinations of the 3transcript levels that are consistent with observed read counts thathit both exons.

Page 96: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Cufflinks

For this reason one must enforce a constraint on the transcripts:for example, we attempt to find the minimal set of transcripts thatexplain the exon counts.

Cufflinks quantifies the level of expression of a transcript via thenumber of fragments per kilobase of exon per million fragmentsmapped (FPKM).

This attempts to account for the fact that longer transcripts willgenerate more reads than shorter transcripts due to their length.

Cuffdiff is a program that is bundled with Cufflinks that allows oneto test for differences between experimental conditions.

Page 97: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Cufflinks

The log of the ratio of FPKM values are approximately normalrandom variables: this allows one to define a test statistic once onedetermines an approximation to the standard error of this log ratio.

This standard error depends on the number of isoforms for thetranscript: if there is a single isoform there is no uncertainty in theexpression estimate produced by the algorithm and the only sourceof variability is from biological replicates.

If there are no replicates within conditions the standard error iscomputed by examining the variance across the 2 conditions.

If there are not many transcripts that differ in their level ofexpression this will provide a reasonable standard error, but if thereare replicates one can obtain better estimates of the variance ofthe log ratio by using the negative binomial distribution.

Page 98: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Cufflinks

If there is only a single isoform then an approach similar to thatused in the R package DESeq is used, but if there are multipleisoforms then uncertainty in the isoform expression estimates mustbe taken into account: this is done using a generalization of thenegative binomial distribution called the beta negative binomialdistribution.

Some useful details:

While both BAM and SAM files can be used for input the programrequires that if chromosome names are present in the @SQ recordsin the header of the file then the reads must be sorted in the sameorder.

Page 99: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

We’ve already installed and tested the latest version of TopHat, sonow let’s install Cufflinks using what is now the familiar strategy:download the tarball, unpack it and write the executables to ourpath by copying them to bin.

[cavanr@boris bin]$ wget

http://cufflinks.cbcb.umd.edu/downloads/cufflinks-2.1.1.Linux x86 64.tar.gz

[cavanr@boris bin]$ tar zxvf cufflinks-2.1.1.Linux x86 64.tar.gz

[cavanr@boris bin]$ cd cufflinks-2.1.1.Linux x86 64

[cavanr@boris cufflinks-2.1.1.Linux x86 64]$ cp * $HOME/bin/./

Page 100: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

Then test the installation using some test data, so download andrun.

[cavanr@boris cufflinks-2.1.1.Linux x86 64]$ wget

http://cufflinks.cbcb.umd.edu/downloads/test data.sam

[cavanr@boris cufflinks-2.1.1.Linux x86 64]$ cufflinks test data.sam

and you should see

Page 101: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

You are using Cufflinks v2.1.1, which is the mostrecent release.[bam header read] EOF marker is absent.[bam header read] invalid BAM binary header (this isnot a BAM file).File test data.sam doesn’t appear to be a valid BAMfile, trying SAM......[11:36:11] Assembling transcripts and estimatingabundances.> Processed 1 loci. [*************************] 100%

Which indicates that this is working properly. This generates aGTF file called transcripts.gtf which has information on transcriptsthat were identified and their level of expression.

Page 102: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

As an example of using the TopHat/Cufflinks programs we willanalyze the data that was used to illustrate the protocol developedby the authors of the programs (this is from the Nature Protocolspaper from March 2012).

The data is already loaded on the server and we’ve installed all ofthe software, so we are almost ready to go.

We also need a set of Bowtie2 indexes and if we want to use anannotated transcriptome we need a GTF file with that information.

Page 103: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

These have already been set up in the subdirectory/export/home/courses/ph7445/data

The indexes are in the files with names that start with genomelocated at/export/home/courses/ph7445/data/Drosophila melanogaster/Ensembl/BDGP5/Sequence/Bowtie2Index

(there are 6 of these and I got them from the Illumina website).

The file Drosophila melanogaster.BDGP5.68.gtf was obtained fromensembl.org:ftp://ftp.ensembl.org/pub/release-73/gtf/drosophilia melanogaster.

Page 104: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

First we must use TopHat to map the reads in the fastq files to thetranscriptome.

Note that due to all of the options we are going to use we need tobreak the command up into multiple lines: we do this by endingeach line with a back-slash.[cavanr@boris RNAseq]$ tophat -p 2 -G \/export/home/courses/ph7445/data/Drosophila melanogaster.BDGP5.68.gtf \-o C1 R1 thout \/export/home/courses/ph7445/data/Drosophila melanogaster/Ensembl/BDGP5/Sequence/Bowtie2Index/genome \/export/home/courses/ph7445/data/GSM794483 C1 R1 1.fq \

/export/home/courses/ph7445/data/GSM794483 C1 R1 2.fq

then you should see something like this

Page 105: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

[2013-11-11 14:53:01] Beginning TopHat run (v2.0.9)

-----------------------------------------------

[2013-11-11 14:53:01] Checking for Bowtie

Bowtie version: 2.1.0.0

[2013-11-11 14:53:01] Checking for Samtools

Samtools version: 0.1.19.0

[2013-11-11 14:53:01] Checking for Bowtie index files (genome)..

[2013-11-11 14:53:01] Checking for reference FASTA file

[2013-11-11 14:53:01] Generating SAM header for

/export/home/courses/ph7445/data/Drosophila melanogaster/Ensembl/BDGP5/Sequence/Bowtie2Index/genome

...

Page 106: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

Here is an explanation of the options that have been specified

I -p specifies the number of processors to use

I -G is used to give the location of the GTF file

I -o is used to give the desired location for the output

I last one gives the locations of the fastq files.

One then needs to issue similar commands for all of the othersamples: for these you just change the name of the outputdirectory and the name of the fastq files.

Page 107: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

For example, to process replicate 2 of condition 1 you would issuethe command[cavanr@boris RNAseq]$ tophat -p 2 -G \/export/home/courses/ph7445/data/Drosophila melanogaster.BDGP5.68.gtf \-o C1 R2 thout \/export/home/courses/ph7445/data/Drosophila melanogaster/Ensembl/BDGP5/Sequence/Bowtie2Index/genome \/export/home/courses/ph7445/data/GSM794484 C1 R2 1.fq \

/export/home/courses/ph7445/data/GSM794484 C1 R2 2.fq

Then you would do this for all 3 replicates in both conditions.

Page 108: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

You will notice that the program first looks for a FASTA fileholding the genome of D. melanogaster and if it can’t find it itreconstitutes the file from the Bowtie index (here we have obtainedthis along with the Bowtie indices).

As it can do this very quickly this isn’t much of a concern but wewill need it latter on and it is easy to generate: just issue thiscommand.

[cavanr@boris RNAseq]$ bowtie2-inspect

/export/home/courses/ph7445/data/Drosophila melanogaster/Ensembl/BDGP5/Sequence/Bowtie2Index/genome

> d melanogaster.fa

You should then reference this directory when you specify whereyour index files are located.

Page 109: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

Then we process all of the resulting BAM files using cufflinks asfollows.

[cavanr@boris RNAseq]$ cufflinks -p 2 -o C1 R1 cloutC1 R1 thout/accepted hits.bam

Again -p controls the number of processors and -o gives theoutput directory

The next step is to run cuffmerge to merge all of the transcriptomedata into a common set of transcripts.

To do this you need to create a text file that has the locations ofall of the GTF files that were created running cufflinks.

Page 110: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

Given our conventions for naming the output files from thecufflinks runs this file just has the contents

./C1 R1 clout/transcripts.gtf

./C1 R2 clout/transcripts.gtf

./C1 R3 clout/transcripts.gtf

./C2 R1 clout/transcripts.gtf

./C2 R2 clout/transcripts.gtf

./C2 R3 clout/transcripts.gtf

let’s call that file assemblies.txt, then one runs cuffmerge as follows

Page 111: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

[cavanr@boris RNAseq]$ cuffmerge -p 2 \-g /export/home/courses/ph7445/data/Drosophila melanogaster.BDGP5.68.gtf \-s d melanogaster.fa \

assemblies.txt

This command generates a directory called merged asm thatcontains several GTF files and files that hold the FPKM estimates.

Page 112: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

Finally you issue the cuffdiff command to test for differentialexpression[cavanr@boris RNAseq]$ cuffdiff -o diff out \-b d melanogaster.fa -p 2 -L \C1,C2 merged asm/merged.gtf C1 R1 thout/accepted hits.bam,\C1 R2 thout/accepted hits.bam,C1 R3 thout/accepted hits.bam \C2 R1 thout/accepted hits.bam,C2 R2 thout/accepted hits.bam,\

C2 R3 thout/accepted hits.bam

Page 113: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

As this demonstrates, we need to instruct the program where theFASTA file is, the output directory and the number of processorsas before, but now we must also include information on

I the labels to use -L C1,C2 to denote the conditions

I the location of the merged GTF file

I the locations of all of the BAM files generated by TopHat

Page 114: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using TopHat and Cufflinks

Then this will generate the following output

You are using Cufflinks v2.1.1, which is the most recent release.

[17:17:55] Loading reference annotation and sequence.

Warning: couldn’t find fasta record for ’2LHet’!

This contig will not be bias corrected.

Warning: couldn’t find fasta record for ’2RHet’!

This contig will not be bias corrected.

...

[17:17:31] Testing for differential expression and regulation in locus.

> Processed 11618 loci.

[*************************] 100

Performed 13989 isoform-level transcription difference tests

Performed 10004 tss-level transcription difference tests

Performed 8008 gene-level transcription difference tests

Performed 10922 CDS-level transcription difference tests

Performed 2612 splicing tests

Performed 1599 promoter preference tests

Performing 1865 relative CDS output tests

...

Writing CDS-level read group tracking

Writing read group info

Writing run info

Page 115: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using cummeRbund

We can then examine the output using the R packagecummeRbund, so start up R by typing R at the linux commandprompt.

When in R load the library and read in the output from cufflinks

> library(cummeRbund)> cuff data <- readCufflinks(’diff out’)Creating database diff out/cuffData.dbReading diff out/genes.fpkm trackingChecking samples table...Populating samples table...Writing genes tableReshaping geneData tableRecastingWriting geneData table...

Page 116: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using cummeRbund

Then we can examine the object to see what sorts of tests we canexamine.> cuff data

CuffData instance with:

2 samples

14701 genes

27993 isoforms

18219 TSS

19291 CDS

14701 promoters

18219 splicing

13365 relCDS

Page 117: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using cummeRbund

And we can test for differences in gene expression usingstraightforward commands as follows.

> gene diff <- diffData(genes(cuff data))> sig gene diff <- subset(gene diff,+ significant==’yes’)> nrow(sig gene diff)[1] 269

Page 118: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using cummeRbund

We can make some exploratory plots as follows> pdf("explorPlt1.pdf")> csDensity(genes(cuff data))> dev.off()> pdf("explorPlt2.pdf")> csScatter(genes(cuff data), ’C1’, ’C2’)> dev.off()> pdf("explorPlt3.pdf")> csVolcano(genes(cuff data), ’C1’, ’C2’)> dev.off()> mygene <- getGene(cuff data, ’regucalcin’)> pdf("explorPlt4.pdf")> expressionBarplot(mygene)> dev.off()> pdf("explorPlt5.pdf")> expressionBarplot(isoforms(mygene))> dev.off()

Page 119: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

0.0

0.5

1.0

1.5

0 1 2 3 4log10(fpkm)

dens

ity

sample_name

C1

C2

genes

Page 120: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

10

1000

10 1000C1

C2

genes

Page 121: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

●●

●●

●●

●●

●●●●●●●●●●

●●●

● ●●

●● ● ●●

● ●●●●● ●● ●●●●●

●●

●●●●

●●●●

● ●

● ●●

● ●●●● ●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●● ●●●●●●●●

●● ● ●

●●●●●●●●●●

● ●● ●

●●●

●●

●●

● ●

●●

●●

●●

●●●●●

●●●● ●●●

●●

●●

● ●

●●

●●

●●●●

● ●

●●● ●

●●

●●●●●●●●●●●

●●●

● ●

● ●●●●

●●

●●

●●

●●

●●

● ●●●●●

● ●●

●●

●● ●●●●●●

●●

● ●●● ●

●●

●●

●●●

●● ●●●●●●●

●●

●●

●●

●●●

●●●

● ●

●●

●●

●●●

●●

●●

●●

●●●●●●●●●●

●● ●●● ●

●● ●●

● ●●

●●

●● ●●●●●●●

●●

●●

●●

●● ●●

● ●●

●●

●●

●●● ●●●●●●●● ●●● ●

●●

●●●● ●●

●●

●● ●●●

●●

●●

●●●●

●●●●●

●●●●

●●●●●●

●●

●●

●●● ●●●●●

●● ●

●● ●●●● ●

●● ●●●

●● ●●●

●●●●●●●●●

●●

●●●● ●●●●

●●

●●

●●●

● ●●●

●●

●●

●●

● ●

●●●●●●

●● ●●●●●●●●●●●

●●

●●●●

●● ●●

●●●

●●●●●●

●●●●●●●●●●●

●● ●●●●●

●●●●●

●●●

● ●●

●●

●●●●●●

●●

●●

●●●●●●●●●●●●●●

●●

●●

●●●●●●●

●●●●●● ●

●●●●

●●

●●

●● ●●

●●

●●

●● ●●

●●

●●

●●●

●●●

●● ●●

●●● ●

●● ●

●●● ●

●●

●●

●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●● ●●

●●

●●●

● ●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●●

●●●

●●

●●

●● ●●●

●●

●●

●● ●●●

●●● ●

●●

●●● ●●●●●

●●●

●●

●● ●●●●●●● ●●●●●●

● ●●●●●●●

●●

●● ●●

●●●●●●●●●●●●●

●● ●●●● ●

●●● ●

●●●●

●●

●●

●●

●●●●

●●●●

●●

●●●●●●●●●●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●●

●●

● ●

●●●

●●●●●●●●

●●

●●●●●●

●●●●●● ●

●● ●

● ●●●●●●

●●

●●

● ●●

●●

●●

●●●

●●●

● ●

●●

●●

●●

● ●●●●

●●●

●●● ●●

●●●

● ●●

●●

●●

●●

●●

●●

●●●●● ●

●●●

●● ●●

● ● ●●●

●●

●●

●●

●●●●

●●

●●●

●●

● ●●

●●

●●

●●●●

● ●●●●

●●

●●

●●

●●●●●●● ●●●

●●

●●●●●

●●●

●●

●●

●●● ●

●●●●●● ●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●●

●● ●●

●●

● ●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●● ●●●●●●●●●

● ●●●●●● ●

●●●●●

●●

●●●●

●●●●●●●●●●

●●

●●

●●●●●

●● ●●●●

●●

●●

●●●

● ●

●●

●●●●●

●●

●●●●

●●●

●●

●●

● ●●●●

●●●●●●●●

●●

●●●●●●●●●●

●●●

●●

● ●●

● ●

●●

●●

●●

●● ● ●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●● ●● ●● ●●●●●●

●●

●●●

●●●●●●

●●●

●●●●

●●●

●●● ●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●● ●

●●●●●●

●●●

●●

●●

●●

●●● ●

●●

●●

● ●●●●●●●●●● ●● ●●

●●

● ●

●● ●●●

● ●

●●

● ●●●●●●●●

● ●

● ●● ●

●●

● ●

●●

● ●

●●

●●●

●●

●●●

● ●

●●

●●

●● ●

● ●

●●●● ●●●●

●●●●

●●

●●

●●●●●●●

● ●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●●●● ●●●●● ●

●●●●●●●

●●

●●

● ●●●

●●

●●●

●●

●●

●●

● ●●●●●●●

● ● ●

●●●

●●● ●● ●●●●

●●●●●●●●●●

●●

●●

●● ●●

●●

●●●

●●

● ●●

●●●

●●●●●●●●●

● ●●

●●

● ●●●●●

●●

●●

●●●●●●●●●

● ●

●●

●●●●●●●●●

●●

●●

●●

●●

● ●

●●●●●●

●●

● ●

●●

●●

●●●●

●●●

●●

●●

●●

●●●●

●●●●

● ●●●● ●●

●●

●●●●●●●

● ●● ●●

●●●●●

●●●

●●

●●

●●

●●●●●●

●●●

●●

●●●

● ●

●●●●●●●●●●●●●●●●

●●●●

● ●

●●●●●

●●

●●●

●●

● ●●●

●●● ●●●● ●●●●●●●

●●

● ●

●●●●

●●● ●

●●

●●●

●●●●

●●●

● ●●●●

●●

● ●

●●●●●●

● ●

●●●● ●●●●●●●

● ●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●● ●●●●

●●●

●●●●●●●●

●●●●

●●

●● ●

●●

●●●●●●●●●●●●●●●●

●●

●● ●

●● ●●●●●● ●● ●●●

●●●

●●●

●●

●●

● ● ●

●●●

●●●●●

●●

●●

●●●

●●●

●●●●●●●●●●● ●●

●●●●

●●●●●●●●●●●●●●●

● ●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●●● ●●●

●●

●●●

●●●●●●●

●●●●●●●●●

●●

●● ●

●●●●

●●

●●●

●●●●●●●●● ●●●●

● ●

●●

●●

●●

●●

●●●●●●

●●

●●●●●●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

● ●●●

●●

●●

●● ●●●●●●● ●●●● ●

●●●●●

●●●●●

●●●●●

●●

●●

●●●● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●● ●●●●●●●●●●

●●

●●

●●

●●

●●●

●●●●●● ●●●●●●

●●●●●●●●●●●

●●

●●

●●

●●● ●

●●

●●

●●

●● ●

●●

●●

●●

●●●●●●●●●●●

●●●

●●

●●● ● ●

● ●

●● ●● ●●●●● ●

●●●

●●

●● ●●● ●

●●

●●

●●

●●

●●

●●●●●

●●

●●

● ●●● ●●●●

●●

●●

●●●

●●●

●●●●●●●●●

●●

●●

●●

●●●

●●

●● ●

●●

●●●●● ●● ●●●

●●●●

●●● ●● ●

●●

●●●

●●●

● ●●●●● ●●●●●●●●

●●

●●

●●●

●●●

●●●●●●

●●

●●●●●●

●●

●●

●●●●

●●

●●

● ●●●

●●●●

●●

●●

●●

●●

● ● ●

●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●

●●

●●●●●

●●●●●●●●

●●

●●●

●●●

●●

●●●●●●● ●

●●●

●●

●●●

● ●●●

●●

●●

●●

● ●

●●

●●●● ●●●●●●●●●●●● ●●●

●●

●●●

●●

●●●●●●

● ●●●

●●

●●

●●

●●

● ●●●●

●●● ●●●●●●●●●●● ●●● ●●●●●●●

●●

●●●●●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●● ●●

●● ●●●●●●●●●

●●

●●●●●●

●●●●

● ●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●●●●

● ● ● ●

● ●●●●●●●●●● ●

●●

●●

● ●

●●

●●●●●● ●

● ●●●● ●●●

●●

●●

● ●●

●●

●●●●●●

● ●●

●● ●●●●●●●● ●●●

●●

● ●●●●

●●

●●

●●●●●

● ●●●●

●●

●● ●

●●

●●

●●●●●●●●●

●●● ●●

●●●●●●

● ●

● ●●

● ●

●●

●●

●● ●

●● ●● ● ●

●●

●●●●●●●● ●

●●●●●●●●

● ●

●●

●●

●●

●●●

●●●●●

●●●●●●●

● ●

●●

●●●● ●

●●● ●●

●●● ●

●●

● ●

●●

●●

●●

●●●●

●● ●●

●● ●●

●●

●●

●●

●● ●●●

●●

●● ●

●●●●

●●●

●●

●● ●

●●●●●●

●●

●●

● ●●

●●

●●●●●

● ●●

●●

●●●

●●●

●●●●

●● ●

●●

● ●●●●

●●●●● ●● ●●●●●● ●

●●

● ●

●●

●●

●●●

●●

●● ●●●●●

●●●

●●●

●●●●●●●●●

●●●●● ●

●●

●●●

●●

●●●

●●

●●

●●

●●● ●●●●

●●

● ●

●●●

●●●●●

●●

●●●●●

●●

●●

●●●●●

●●●●●●●●●●● ●●●

●●

●●●

● ●

●●

●●●

●●

●●●●●

●●● ●●●●●●●●●●●●●●●●● ●●●●●●

●●

●●●●

●●

●●

● ●●

●●●

●●

● ●

●●

●●●

●●

● ●●●●

●●

● ●

●●

● ●●●●

●●

●● ●●●●●●●●●●●●●

●●

●●

●●

●●●●

●●●●

●●

● ●

●● ●

●● ●

●● ● ●

●●

●●●

●●●● ●

●●●

● ●

● ●●

●●●●

●● ●

●●

●●

●● ●●●

●●●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●●●

● ●●●●●●

●● ●●

●● ●

●●●● ●●● ●●●●●●●●●●●●●● ●

● ●● ●

● ●

●●

●●●

●●

●●●●●

●●●

●●●●

●●

●●

●●

●●●●●●●●

●● ● ●

●● ●

● ●

●●

●●

●●

●●

●●●●●●

●●

● ●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●

●●

● ●●

●●●●●

● ●

●●●●

●●

●●

●●●●●●

●●●● ●

●●● ●

●●

●●

●●●●●●●●●●

●●●●●●● ●●

● ●●●●●●●●●

●●

●● ●

●●

●●

●●●●●

●●

●●●●● ●●●●

●●●●●●●●●●●●●

●●

●●●●● ●●●

●●

●●

● ●●

●●●●●●●

●●●

● ●

●●

● ●●● ●● ●●● ● ●●●●●●

●●

●●●●●●● ●

●●

● ● ●●

●●

●●

●●● ●●● ●

● ●

●●●●●

●●●●●●●●

●●●●

●●●●

●●●●●

●● ● ●

●●●●●

●●

●●

● ●

● ●●

●●

●●

●●●●●●

●●●●●●●● ●●

●●●●●

●●● ●

●●●●●●●● ●●●●●●

●●

●●

●●

●●

●●●

●● ●

●●

●●●

●●●●●●● ●

●●●● ●● ●

●●●●

●●

● ●●●●●●●

●●

●●

●●●● ●●●●

●●●●

●●●

●●●

●● ●●

●●

●●

●●●●● ●

●●●●●●●●●● ●

●●●●●●

●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

● ●

●●

●●

●●

●●

●●●●●

● ●●●●

● ●●●

●●●●●●●

●●

●●

●● ●●●

●●

●●● ●●●●●●

●●

●● ●●

●● ●

●●●

●●●●●●

●●

●●

●●● ●●

●● ●

●●

●●

●●

●●

●●

● ●●●●

●●

● ●●●

●●

●●●

●●

● ●●●

●●●●

●● ●●●● ●

●● ● ●●

●●

●●

●●

●●

●●●

● ●●●●

●●

●●

● ●

● ●●●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●●●●●●●●●

●●

●●

●● ●

● ●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●● ●●●●●●●●●●●●●●●●●●●

●●

●●●●●

●● ●●●●●●●●●●● ●●●

●●

●●●●● ●

● ●●

● ●●●●●●●●●●● ●●● ●

●●●●

●●

●●

●●● ●●

●●

●●●●●

●●

●●

●●

●●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

● ●●●● ●●●●●●

●●● ●●

●●●

●●

●● ●●

●●●●●

● ●

●●

●●●

●●● ●●

●●

●●●●●●●●●●●●● ●

●● ●●

●●●●

●●● ●●●●●

● ●●●

●●

●●

● ●●

●●● ●

●●●●●●●

●●

●●●●

●●●

●● ●

● ●●●●●

●●

●●

●●●

● ●

● ●●●●

●●●

●●

●● ●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●● ●●●

●●

●●

●●

●●

●●

●●●●● ● ●

●●●●●●●

●●● ●●

●●

●●

●● ●●●● ●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●

●●

●●●●

●●● ●

●●

● ●

●●

●●●

●●

●●●●

●●●

● ●●

●●●

●●●● ●●●

●●

●●●● ●

● ●

●●

●● ●●●●●

●●●

●●

●●●●●●

●●● ●●●

●●

●●●

●●

●●

●●●● ●●●●●●

● ●

●●

●●

●●

●●●●●●● ●●●●

●●● ●●● ● ●●

●●

●●●●

●● ●●●●●●●

●●● ●

●●

●●

●●●●●●●● ●● ●●●

● ●●●●●●●●

●●●● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●● ●●●●●●●●● ●

●●●●

●●●

●●

●●

●●●

●● ●

●●●●

●●

● ●

●●

●●

● ●●

●●

● ●●

●●

●●

●● ●●

● ●●● ●

● ●●●●

●●

●●

●●

●●●

●●

●●● ●

● ●●●

●● ● ●●● ●●

●●

●●

● ●

●●●●●

●●●●●●●●

●●●

● ●

●●

●●●●●

● ●

●●

●●●●●●●●

●●

●●●●

●●●

●●●

●●●

●●●●●●●

●●

●●●●●

●●

●●●

●● ●●

●●

● ●●●●●●●

●●●

●●

●●●●●

●● ●●

●●

●●

●●

●●●●

●●

●● ●

●●

●●●●●●

●●●● ●●

● ● ●

●●

●●●

●● ●●

●●

●● ●

● ●

●●

● ●●●

●●

●●

●●

●●●

● ●● ●

●●●●●

●● ●●● ●●

●●

●●

●●

● ●● ●●

●●

●●

●●

●●●●●●●●

●●

●●

● ●

● ●● ●

●●●●●●● ●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●

●●

●●●●●●●●

●●

● ●●●

●●

●●

● ●●● ● ●

●●●

●●●●●●●●●

●●

●●●

●●

● ● ●

●●●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●●

●●● ●

● ●●●●●

●●

●● ●

●●●

●●

●● ●●

●●●

●● ●●●

●●●●●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●●●

●●

●●●

●●●

●●●●●

●●●

●●

●●●●

●●●●● ●

●●●● ●●●

●●

●●●●● ●●●

● ●● ●●●●● ●

●●

●●

●●●

● ●

●●

●●

●●●●●

●●● ●●●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●●●●

●●

●● ●●

●●●●●●●

●●

●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●

●●●● ●

●●●●●

●●●●●

●●

●●

●●●

●●● ●●● ●●

●●●●

●●

●●●●●

●●

●●

●●●●●●●

● ●●●●●●

● ●●●●●●●

● ●●●● ●

●●● ●●

●●

●●●

●●

● ●

●●●●●●

●●

●●●●●

●●

●●●

●●

●●●●●●●

●●

● ●●●

●●

●●

●●

●●●

●●●

●●

●● ●

●●●

●●●●●●●●●●●

● ●●

● ●

●●

● ●●●●●●●● ●●●●● ●●●●

●●●

●●

●●

●●●● ●●● ●

●●

●●

●●

●●

●●●● ●

●●

●●

●●

●●

● ●

●●●

●●● ●●●

●●

●●

●●●● ●●●●● ●

●●● ●

●●●

● ●

●● ●●● ●

●●●●●●●●

●●

●●

●●

● ●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●● ●●●

●●

●●

●●

●● ●

●●● ●

●●

●● ●●● ●

●●

●●●●●●●

●●●●

●●●●●

● ●●●●

●●

●●●● ● ●●

●●●●●●

●●●

●●●● ●●●

●●●●●

●●

●●●

●●●

●●

●●

●●●●●

●●●● ●

●●

●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●

●●

●●● ●

●●

●●

●●

●●●● ●●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●● ●●

●●

●● ●

●●

●●

● ●●●●●●

● ●

●● ●

●●●● ●●

●●●

●●

●●

●●●●●

● ●●●

●●

●●

●●

● ●●● ●●●●

●● ●● ●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

● ●

●●●●●●● ●●●●●

●●●●● ●

●●

●●

●●

●●●● ●

●● ●●●

●●

●●

●●

●●

●●

●●● ●●

●●

● ●●

●●●●●

●●●

●●

● ●

●●●●●●●

●●●

●●

●●

●●

●● ●●●●

●●●

●●●

●●●

●●●

● ●

●●

●●●●●●●

●●

●●●●●

●●

●●●● ●●

● ●

●●

●●●● ●●● ●●

●●

●●●

●●

●● ●●● ●●

●●

●● ●

●●

●●●●●●●●●●●●●● ●●●●

●●● ●● ●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●●●●●●

●●

●●●●

●●

●●

●●●

●●●

●●

●● ●●●

●● ●

● ●●●●●●

●● ●●●●

●●●

● ●

●●

● ●

● ●●●●●

●●●

●●●●●●●●●

●●●●●

● ●●●

●●

●●

●●

●●●

● ●●● ●●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●● ●●

●●●

● ●

●●

● ●

●●

●●● ● ●

●●

●●

●●

●●

●●

●●

● ●●●●

●●

●● ●

●●

●●

●●●●●●●

●●●●●●●

●●●

●●

● ●

● ●●●● ● ●

● ●●●●● ●

●●

●●●

●●

●●●●●

● ●

● ●

●●

●●●●●●

●●

●●●

●●

●●●

● ●

●●●● ●●

●●

●●

●●●●●

●●●

●●●●●

●●

●●

●●● ●● ●●●●●● ●

●●●●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●● ●

●●

●●●●●

●●●● ●

● ●●

●●●

●●●●

●●● ●●

●●●●

●●●●●●●●

●●

●●●

●●●● ●●

●● ●

●●

●●

●●

●●

●●●

●●●

●●

●●● ●●●●●

●●

●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●

●●

●● ●● ● ●●●

●● ●

●●

● ●●●●●●

●●

●●

● ●●●●

●●●

●●●●●

●●

●●

●●●

●●●●●●●●●●●●●●

● ●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●● ●●●● ●

● ●

●●●●

●●

●●●● ●

● ●●● ●

●●

●● ●

●●

●●

● ●

● ●●

●●

●●●

●●●●●●●●●

●●● ●

●●

●●●

●●●●●

● ●●

●●

● ●●●●●●●●

●●●

●●●

●●

●●●

●● ●

●●●

●●●●●

●● ●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0

1

2

3

4

−20 −10 0 10 20log2(fold change)

−lo

g 10(

p va

lue)

significant●

no

yes

genes: C2/C1

Page 122: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

OK OK

XLOC_012969

0

1000

2000

3000

C1

C2

sample_name

FP

KM

regucalcin

Page 123: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

OK OK OK OK

OK OK OK OK

TCONS_00024643 TCONS_00024644

TCONS_00024645 TCONS_00024646

0

1000

2000

0

1000

2000

C1

C2

C1

C2

sample_name

FP

KM

regucalcin

Page 124: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using cummeRbund

But one can go beyond just testing for differences in geneexpression. We can also do tests for differences in isoform, TSS,and CDS usage as follows.

> isoform diff <- diffData(isoforms(cuff data), ’C1’, ’C2’)> sig isoform diff <- subset(isoform diff,+ significant==’yes’)> nrow(sig isoform diff)[1] 234> tss diff <- diffData(TSS(cuff data), ’C1’, ’C2’)> sig tss diff <- subset(tss diff, significant==’yes’)> nrow(sig tss diff)[1] 271> cds diff <- diffData(CDS(cuff data), ’C1’, ’C2’)> sig cds diff <- subset(cds diff, significant==’yes’)> nrow(sig cds diff)[1] 271

Page 125: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using cummeRbund

Note that the values returned by diffData are data.frames so wecan examine and manipulate individual elements for furtherinvestigation.

> colnames(tss diff)

[1] "TSS group id" "TSS group id" "sample 1" "sample 2"

[5] "status" "value 1" "value 2" "log2 fold change"

[9] "test stat" "p value" "q value" "significant"

so we can just do

> table(tss diff[,12])no yes

17948 271

Page 126: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNA-Seq using cummeRbundPromoters, splicing and relative CDS are treated a little differently:one uses the distValues function as follows (this also returns a dataframe so we can manipulate the results).

> promoter diff <- distValues(promoters(cuff data))> sig promoter diff <- subset(promoter diff,+ significant==’yes’)> nrow(sig promoter diff)[1] 132> splicing diff <- distValues(splicing(cuff data))> sig splicing diff <- subset(splicing diff,+ significant==’yes’)> nrow(sig splicing diff)[1] 125> relCDS diff <- distValues(relCDS(cuff data))> sig relCDS diff <- subset(relCDS diff,+ significant==’yes’)> nrow(sig relCDS diff)[1] 133

Page 127: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Negative binomial distribution

The Poisson distribution is commonly used as a probability modelfor data that arises as counts (i.e. the number of occurrences ofsome event).

For this reason, early RNA-Seq papers used this as a model for thenumber of reads mapping to some genomic feature (e.g. an exon).

However one feature of the Poisson distribution is that the mean isequal to the variance and frequently the observed variance exceedsthe mean: we call this overdispersion.

If one assumes that the means of the count data follow a gammadistribution (which is a particular probability distribution) then ifyou average over the gamma distribution the distribution of thecounts follows a negative binomial distribution.

Page 128: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Negative binomial distribution

We can do this in R with simulation:

> c1 <- rpois(100, 2)> table(c1)c10 1 2 3 4 5 9

12 26 22 29 9 1 1

and if we look at the mean and variance:

> mean(c1)[1] 2.07> var(c1)[1] 1.984949

Page 129: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Negative binomial distribution

Now simulate data for 100 exons so that first 50 have mean 2 andthe last 50 have mean 6 and examine

> for(i in 1:50) exCnt[i]=rpois(1, 2)> for(i in 51:100) exCnt[i]=rpois(1, 6)

then the mean and variance are still about equal if you look withinexon groups with the same parameters

> mean(exCnt[1:50])[1] 2.3> var(exCnt[1:50])[1] 1.846939> mean(exCnt[51:100])[1] 6.14> var(exCnt[51:100])[1] 7.265714

Page 130: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Count based methods

but

> mean(exCnt)[1] 4.22> var(exCnt)[1] 8.23394

the variance is larger than the mean.

Page 131: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

Now, rather than just using 2 distinct values for the means supposethat they are generated from a gamma distribution, then use thesesimulated means to simulate data using the Poisson distribution

> mSim <- rgamma(100,5,3)> exCnt <- rpois(100, mSim)> mean(exCnt)[1] 1.87> var(exCnt)[1] 2.599091

So here, the distribution of exCnt is negative binomial.

Page 132: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

There are 2 popular packages (called edgeR and DESeq) thatassume that one has data in the form of counts of the number oftimes a read mapped to a genomic feature (e.g. a gene or atranscript).

The counts are then modeled as being independently distributedaccording to the negative binomial distribution.

The negative binomial distribution has 2 parameters: onedetermines the mean and the other, called a dispersion, determineshow much bigger the variance is than the mean.

Both packages first estimate the size of the library (in slightlydifferent ways) then use a regularized estimate of the dispersion(like variance estimation in microarray analysis).

Page 133: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

The differences are:

1. the packages use different methods for estimating the librarysizes (i.e. the sequencing depth in the different samples)

2. the DESeq package assumes that the dispersions are relatedto the means and uses this relationship to estimate the featurelevel dispersions.

3. the DESeq package uses a computationally faster and simplermethod for estimating the dispersion parameters (but relies onan approximation that the sum of 2 negative binomial randomvariables is itself distributed according to the negativebinomial distribution)

4. the DESeq package provides a set of graphics functions formodel diagnostics

Page 134: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

The authors of the DESeq package claim that the second propertyis what makes their method superior because it allows theiralgorithm to have equal power over the entire range of dispersions.

This makes edgeR conservative for highly expressed genes andanticonservative for genes expressed at low levels.

A mathematical proof of this claim isn’t presented, but this is seenwhen analyzing data and in simulations.

Page 135: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

We can also get data that has been aligned and for which featurecounts are available from GEO.

As an example, we will use data from the experiment in which 2groups of cows were under different levels of stress and liverbiopsies were obtained near the time of calving (we saw this whendiscussing quality assessment).

The table of counts is on the server, so we can just read in thedata.

> bovCnts <- read.table("/export/home/courses/ph7445/data/bovineCounts.txt")

> dim(bovCnts)[1] 21996 11

Now set up an indicator for group membership.

grp <- factor(c(rep(1,5),rep(2,6)))

Page 136: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

We will next apply a filter to the data and exclude transcriptswhere the minimum count across all subjects is less than 5

> f1 <- apply(bovCnts,1,min)> sum(f1>4)[1] 10990> bovCntsF <- bovCnts[f1 > 4 ,]> dim(bovCntsF)[1] 10990 11

Now we will use the edgeR package to test for differentialexpression.

Page 137: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq with edgeR

After installing and loading the edgeR package from Bioconductorwe first initialize our DGE list, then estimate the mean dispersionparameter and then use this to get the individual dispersions.

> delist <- DGEList(counts=bovCntsF, group=grp)Calculating library sizes from column totals.> delist <- estimateCommonDisp(delist)> delist <- estimateTagwiseDisp(delist)

Then we conduct the exact test that is similar to Fisher’s exacttest.

> et <- exactTest(delist)

Page 138: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

Then we examine the result

> head(et$table)logFC logCPM PValue

2 0.03015613 5.031810 0.7872820984 0.76565354 2.868241 0.0343618145 0.29303532 6.598597 0.0632897826 0.45430051 8.273396 0.0034958937 -0.30647946 3.695769 0.3330088238 -0.12051436 3.505939 0.819749119

Page 139: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

Then we use the Benjamini Hochberg method to correct formultiple hypothesis testing.

> padj <- p.adjust(et$table[,3], method="BH")> sum(padj<0.05)[1] 164

So we are detecting 164 differences in transcript levels whilecontrolling the FDR at 5%.

For this example there doesn’t appear to be any connectionbetween the estimated level of gene expression (in terms of theaverage log2 of the counts per million) and the p-values.

> cor(et$table[,2], et$table[,3])[1] -0.02069573

Page 140: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

We can also use the DEseq package in a similar fashion: initializethe count data set, estimate library sizes, estimate dispersions andthen conduct a test.

> cds <- newCountDataSet(bovCntsF, grp)> cds <- estimateSizeFactors(cds)> cds <- estimateDispersions(cds)> res <- nbinomTest(cds, "1", "2")

Page 141: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

Next we examine res

> head(res)

id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval

1 2 104.14912 102.96024 105.13985 1.0211695 0.03022234 0.92734232

2 4 22.14239 16.58914 26.77009 1.6137116 0.69038273 0.08921768

3 5 305.48155 271.18398 334.06285 1.2318679 0.30084759 0.19480965

4 6 976.65013 828.93664 1099.74470 1.3266933 0.40783491 0.02163098

5 7 41.14755 46.32049 36.83677 0.7952586 -0.33050401 0.35883953

6 8 36.31335 37.45815 35.35935 0.9439694 -0.08318802 0.69298641

padj

1 1.0000000

2 1.0000000

3 1.0000000

4 0.8406186

5 1.0000000

6 1.0000000

Page 142: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

Then we can examine the adjusted p-values (these use theBenjamini Hochberg adjustment) and see that there are fewerdifferences when we use this method compared to edgeR.

> sum(res[,8]<.05)[1] 42

In fact, every transcript identified by DESeq was also detected todiffer using edgeR.

> table(res[,8]<.05, padj<0.05)FALSE TRUE

FALSE 10826 122TRUE 0 42

We can also make a plot to investigate how the dispersions dependon the mean (which is the crucial difference between DESeq andedgeR).

Page 143: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

To this end create the following function,

plotDispEst <- function(cds){plot(rowMeans(counts(cds, normalized=TRUE)),fitInfo(cds)$perGeneDispEsts, pch=’.’, log="xy")

xg <- 10^seq(-.5, 5, length.out=300)lines(xg, fitInfo(cds)$dispFun(xg), col="red")

}

Page 144: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

and use it as follows

> pdf("DESeqPlot.pdf")> plotDispEst(cds)Warning message:In xy.coords(x, y, xlabel, ylabel, log) :474 y values <= 0 omitted from logarithmic plot

> dev.off()

Then we see for this data set there is not a very strong relationshipbetween the means and the estimated dispersions.

Page 145: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

10 20 50 100 200 500 1000 2000 5000

1e−

041e

−03

1e−

021e

−01

1e+

00

rowMeans(counts(cds, normalized = TRUE))

fitIn

fo(c

ds)$

perG

eneD

ispE

sts

Page 146: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

RNAseq data analysis

We can also examine the association between the estimated foldchanges produced by the 2 methods (they are highly correlated).

> cor(2^et$table[,1], res[,5])[1] 0.9621653> pdf("edgeRvsDESeq.pdf")> qplot(edgeRFC, DESeqFC, colour= -log(edgeRp.value,10))> dev.off()

Page 147: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

●●

●● ●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●●●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●

● ●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

● ●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●● ●●●

●●

●●

●●

●●

●● ●●

●●

●●

●●●

●●●

● ●

●●

●●●

●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

● ●

●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●●

●●●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●●

● ●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

●●●●

●●

●●●

●●●

●●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

● ●●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●●

●●

●●●

●●●

●●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●

●●●

●● ●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●● ●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●●●

●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

● ●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

●●●●

●●●

●●●

●●●●●●●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●●

●●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●● ●●●●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●●

●●

●●

●●

●●

●●●●

●●●●●●●

●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●●

● ●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

● ●

●●●●●

●●●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●●

●●●●

●●

●●●●●●

●●●

●●

●●

● ●● ●

●●●

●●

●●●

●●●

●●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●

●●●●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●●●

●● ●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●●

●●●

●●●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●●

● ●

●●

●●

●●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●●●

●●

● ●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●● ●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●●●

●●●

●●●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●●●

●●

●● ●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

● ●

●●●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●●●

●●●● ●

●●●

●●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●●

●●● ●

●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●● ●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●● ●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●● ●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

● ●●●

●●

●●

●●

●●●●

●●●

●●●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●●●

●●

●●●●●

●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

● ●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●● ●●

●●

●●

●●

●●●

● ●●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●● ●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

● ●

●●

●●

0

1

2

3

4

5

0 1 2 3 4edgeRFC

DE

Seq

FC

0

5

10

15

20−log(edgeRp.value, 10)

Page 148: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

DNA-Seq data analysis

There are a number of approaches to the analysis of DNA-Seqdata.

Again we will only consider approaches that have been developedfor organisms with genomes.

The basic problem all of these methods must try to solve is whento flag a variant based on disagreements between an individual’spile up and the reference genome.

We will cover the functionality available for this in SAMtools, an Rpackage called VPA and mention some aspects of the genomeanalysis toolkit.

Page 149: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BCFtools

Now we can use bcftools to flag potential variants. To do so issuethe following command:

[cavanr@boris samtools-0.1.18]$ samtools mpileup -ufexamples/ex1.fa ex1.bam |bcftools/bcftools view -bvcg - > var.raw.bcf

then move the bcf file to the bcftools directory

[cavanr@boris samtools-0.1.18]$ mv var.raw.bcfbcftools/.

and then go into that directory and issue the command

[cavanr@boris bcftools]$ ./bcftools view var.raw.bcf| ./vcfutils.plvarFilter -D100 > var.flt.vcf

(here the option -D100 sets a maximum on the read depth forcalling a SNP).

Page 150: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BCFtools

and then examine the resulting output file (well I formatted it alittle to make it easier to read)[cavanr@boris bcftools]$ more var.flt.vcf

##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18 (r982:295)

##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">

##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse,

alt-forward and alt-reverse bases">

##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering

reads">

...

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

ex1.bam

seq1 288 . A ACATAG 217 . INDEL;DP=26;VDB=0.1018;AF1=0.5;AC1=1;DP4=6,4,8,4;MQ=60;

FQ=217;PV4=1,0.42,1,0.12 GT:PL:GQ 0/1:255,0,255:99

seq1 548 . C A 133 . DP=39;VDB=0.0921;AF1=0.5;AC1=1;DP4=11,8,4,13;MQ=60;

FQ=136;PV4=0.049,0.085,1,1 GT:PL:GQ 0/1:163,0,192:99

seq1 1294 . A G 145 . DP=42;VDB=0.1155;AF1=0.5;AC1=1;DP4=13,6,7,11;MQ=58;

FQ=146;PV4=0.1,0.13,1,0.28 GT:PL:GQ 0/1:175,0,178:99

seq2 156 . AA AAGA 71.5 . INDEL;DP=11;VDB=0.0863;AF1=1;AC1=2;DP4=0,0,7,0;MQ=60;

FQ=-55.5 GT:PL:GQ 1/1:112,21,0:39

seq2 505 . A G 158 . DP=47;VDB=0.1087;AF1=0.5;AC1=1;DP4=13,11,9,14;MQ=60;

FQ=161;PV4=0.39,0.29,0.16,1 GT:PL:GQ 0/1:188,0,198:99

seq2 784 . CAATT CAATTAATT 214 . INDEL;DP=49;VDB=0.1155;AF1=1;AC1=2;DP4=0,0,18,18;MQ=57;

FQ=-143 GT:PL:GQ 1/1:255,108,0:99

seq2 1344 . A C 106 . DP=30;VDB=0.1087;AF1=0.5;AC1=1;DP4=5,9,3,9;MQ=59;

FQ=109;PV4=0.68,0.09,0.06,1 GT:PL:GQ 0/1:136,0,172:99

Page 151: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

BCFtools

You can see that there are 2 sequences and sequence 1 has 3variants while sequence 2 has 4 variants. Moreover they are allhigh quality calls as the phred scores are quite high (e.g. 217means the probability of an error at that call is 1.995e-22).

and you can generate a consensus sequence via the following

[cavanr@boris samtools-0.1.18]$ samtools mpileup -uf examples/ex1.fa

ex1.bam | bcftools/bcftools

view -cg - | bcftools/vcfutils.pl vcf2fq > cns.fq

Further details can be found at

http://samtools.sourceforge.net/mpileup.shtml

Page 152: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

There is an R package that is designed to allow reading of vcf filescalled VPA.

Recall that BCFtools can generate vcf files from fastq files.

This relies on a linux command line tool called tabix, so one needsto download that and uncompress it as with Bowtie, SAMtools,etc.

First install the package (note: it is not part of bioconductor).

> install.packages("VPA",repos=c("http://R-Forge.R-project.org",

"http://cran.r-project.org"))

Page 153: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

Now we will use the example data that ships with the package. Sofind the data and display information about the files.

> dirpath=system.file("extdata",package="VPA")read.table(file.path(dirpath,"index1.txt"),sep="�")

V1 V2 V3 V41 1151HZ0001 A 1151HZ0001.flt.vcf 1151HZ0001.vcf.gz2 1151HZ0002 A 1151HZ0002.flt.vcf 1151HZ0002.vcf.gz3 1151HZ0004 A 1151HZ0004.flt.vcf 1151HZ0004.vcf.gz4 1151HZ0005 A 1151HZ0005.flt.vcf 1151HZ0005.vcf.gz5 1151HZ0006 B 1151HZ0006.flt.vcf 1151HZ0006.vcf.gz6 1151HZ0007 B 1151HZ0007.flt.vcf 1151HZ0007.vcf.gz

The file index1.txt contains information about the subjects, suchas what group each belongs to (in column V2) and the location ofthe vcf files relative to the dirpath path.

Page 154: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

We can load the data and perform position level quality filteringwith a command like the following.

> varflt <- LoadFiltering(file.path(dirpath,"index1.txt"),+ datadir=dirpath, filtering=TRUE, alter.PL=20,+ alter.AD=3, alter.ADP=NULL, QUAL=20, DP=c(10,500),+ GQ=20,+ tabix="/home/cavan/Documents/NGS/tabix-0.2.5/tabix",+ parallel=FALSE)

The first argument is a formatted file that gives information aboutthe subjects, the second indicates where to look for the data, thefiltering argument is indicates that we want quality filtering to takeplace and the others are as follows:

Page 155: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

I alter.PL: phred scaled threshold for the genotype calls

I alter.AD: the minimum depth for calling a variant allele

I alter.ADP: minimum percentage of read depth containing avariant allele

I QUAL: phred scaled threshold for a variant call

I DP: gives the range for depth of coverage

I GQ: phred scaled score for the most likely genotype

I tabix: the location of an executable version of tabix

I parallel: should the function use the snowfall package to dothe computation in parallel-if TRUE then one needs to specifypn, which is the number of processors desired.

Page 156: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

Then can take a look at the object

> varflt$vcflist$vcflist$‘1151HZ0001‘VCF dataCalls: 49 positions, 1 sample(s)Names: HEAD CHROM POS ID REF ALT...$vcflist$‘1151HZ0002‘VCF data

We can use this package to look for patterns of variants.

For this example there are 4 cases and 2 controls so we willprioritize variants that occur in at least 1 case but no controls.

Page 157: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

To do this, we specify the minimum and maximum frequency forvariants in the case group (sample A) and the control group(sample B) as follows.

> pattern <- cbind(A=c(1/4,1), B=c(0,0))> varpat1 <- Patterning(varflt, pattern,+ paired=FALSE, not.covered=NULL, var.PL=c(FALSE,+ TRUE))

We can then inspect this

Page 158: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

> varpat1$VarVCF$VarVCF$‘1151HZ0001‘VCF dataCalls: 12 positions, 1 sample(s)Names: HEAD CHROM POS ID REF ALT...$VarVCF$‘1151HZ0002‘VCF dataCalls: 11 positions, 1 sample(s)Names: HEAD CHROM POS ID REF ALT...$VarVCF$‘1151HZ0004‘VCF dataCalls: 8 positions, 1 sample(s)Names: HEAD CHROM POS ID REF ALT......

Page 159: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

Then you can supply this to the vcfreq function to test fordifferences-it also shows the reference sequence and the variantspresent for each subject.

> vfreq1 <- vcfreq(varpat1, method="fisher.test", p=1)

> vfreq1

REF 1151HZ0001 1151HZ0002 1151HZ0004 1151HZ0005 1151HZ0006 1151HZ0007

1:949053 "C" "C/G" "C/G" "." "C/G" "." "."

1:970836 "G" "G/A" "." "G/A" "." "." "."

1:985239 "C" "C/T" "." "C/T" "." "." "."

1:985900 "C" "C/T" "." "C/T" "." "." "."

1:985999 "G" "G/T" "." "G/T" "." "." "."

...

A B p.value

1:949053 "0.375" "0" "0.490909090909091"

1:970836 "0.25" "0" "0.515151515151515"

1:985239 "0.25" "0" "0.515151515151515"

1:985900 "0.25" "0" "0.515151515151515"

1:985999 "0.25" "0" "0.515151515151515"

Page 160: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

So the first variant identified was found in 3 of 8 possible locationsin the cases and not in any of the controls: so as a check, here isFisher’s exact test:

> fisher.test(matrix(c(3,5,0,4), 2, 2))Fisher’s Exact Test for Count Data

data: matrix(c(3, 5, 0, 4), 2, 2)p-value = 0.4909alternative hypothesis: true odds ratio is not equalto 195 percent confidence interval:0.2035030 Inf

sample estimates:odds ratio

Inf

Page 161: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Using VPA

We can also look up information about the number positions tosee if the variants are in a gene and determine which gene if that isthe case.

> Pos2Gene("1", "949053", level="gene", ref="hg19")

trying URL

’http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz’

Content type ’application/x-gzip’ length 4080852 bytes (3.9 Mb)

opened URL

==================================================

downloaded 3.9 Mb

trying URL

’http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.sql’

Content type ’text/plain’ length 1682 bytes

opened URL

==================================================

downloaded 1682 bytes

[1] "gene" "ISG15"

So the variant numbered 949053 is in the gene ISG15 according tothe hg19 UCSC genome (which corresponds to the genomereference consortium GRCh37, the latest version).

Page 162: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

GATK

For SNP identification the best practice currently seems to bebased on using the genome analysis toolkit (GATK).

This toolkit consists of a collection of java programs thatimplement a multistep analysis.

There are 3 phases to the analysis process.

Page 163: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

GATK

In the first phase, raw reads are mapped to generate SAM files andan accurate base error model is used to improve on the vendor’sbase quality calls (which are not very accurate).

In phase 2, the SAM/BAM files are analyzed to discover SNPs,short indels and copy number variants.

Finally, in phase 3 additional information, such as pedigreestructure, known sites of variation and linkage disequilibrium, isbrought to bear on the detected variants.

Page 164: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Genome STRiP and copy number variation

There is substantial evidence from multiple sources that there isextensive variation in terms of deletions and insertions.

Such deletions and insertions imply that certain individuals aremissing (or have extra copies of) certain portions of the genome.

This leads to copy number variation (CNV).

With next generation sequencing we can detect this from pairedend reads that span too large of a genomic segment and from localvariation in read depth.

Page 165: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Genome STRiP and copy number variation

The GATK has a workflow for detecting copy number variations:genome STRiP (for genome structure in populations).

To use this set of tools you need a FASTA file containing thereference sequence used for alignment and the BAM files that arethe result of the alignment.

The FASTA file must be indexed using the faidx tool inSAMtools.

Page 166: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Genome STRiP and copy number variation

You also need to install the alignment program BWA and a set ofprograms called Picard.

The result is a file in the vcf format that has variant calls for eachsubject.

You also need data from at least 20 to 30 subjects, however if yoursample size is too small you can use data from the 1000 Genomesproject as your background population.

Page 167: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

One application of DNA-Seq technology is to determine theproportion of microbes in a sample.

We can do this as it has been discovered that all bacterial specieshave a component of the ribosome (the 16S ribosomal RNA) thatis similar across species but differs by a sufficient amount atcertain hypervariable locations to allow identification of at leastthe genus of a bacterial species.

We note that there are array based tools that allow one to do thistoo, e.g. the human oral microbe identification microarray(HOMIM).

For example we may want to determine what bacterial species arepresent in some sample, for example, in someone’s mouth.

Page 168: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

The primary limitation of this method is that some commonspecies have very similar sequences, for example here is part of analignment of 2 Streptococcal species (S. anginosus and S.intermedius) that shows 97% sequence similarity:

Length=1558

Sort alignments for this subject sequence by:

E value Score Percent identity

Query start position Subject start position

Score = 2571 bits (2850), Expect = 0.0

Identities = 1502/1555 (97%), Gaps = 0/1555 (0%)

Strand=Plus/Plus

Query 19 TTTGATCCTGGTTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGGACGCAC 78

||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||

Sbjct 1 TTTGATCCTGGTTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGAACGCAC 60

Query 79 AGTTTATACCGTAGCTTGCTACACCATAGACTGTGAGTTGCGAACGGGTGAGTAACGCGT 138

|| | ||||||| || ||||||| || ||||||||||||||||||||||||||||||

Sbjct 61 AGGATGCACCGTAGTTTACTACACCGTATTCTGTGAGTTGCGAACGGGTGAGTAACGCGT 120

...

Page 169: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

With the HOMIM system one obtains a value between 0 and 5that indicates the quantity of a combination of species.

So a simple application would be to test for differences in microbelevels between 2 patient groups.

Given the semi-quantitative nature of this data I have usedpermutation tests to test for differences-this isn’t toocomputationally intensive as there are only a several hundredbacteria represented on the microarray.

What is tricky is that any given probe typically interrogatesmultiple species, so going from a set of differentially expressedprobes to a collection of differentially expressed species is difficult.

Page 170: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

A far more popular way to address the question of differentialmicrobe levels is to use next generation sequencing.

So, one obtains a sample, sequences short reads then aligns thesereads to the 16S rRNA DNA sequence of many bacteria todetermine which it matches the best.

Rather than using Illumina’s sequencing technology,pyrosequencing (or 454 sequencing) is typically used in theseinvestigations.

This technology gives longer reads than Illumina’s method: 800bases are typical currently (and this appears to be increasing overtime, as does Illumina’s).

Page 171: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

Most researchers take advantage to the ribosomal database projectfor alignment of the sequences they generate.

Cole JR, Wang Q, Cardenas E et al. (2008), “The ribosomaldatabase project: improved alignments and new tools for rRNAanalysis”, Nucleic Acids Research, 37, D141-D145.

This resource can then be used to generate a table of counts foreach sample and each bacterial species detected.

We can then use edgeR or DESeq to test for differences betweengroups (and control for covariates).

Page 172: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

MicrobiomicsAs an example, we will consider a data set I came across where theresearchers were interested in bacterial populations in the lungs ofCOPD patients (thanks to Alexa Pragman, and the Isaacson andWendt labs).

COPD is chronic obstructive pulmonary disease and is verycommon medical problem throughout the world (its main cause issmoking).

As frequent infections are a common symptom, the researcherswere interested in comparing the lung microbiome of COPDsubjects (classified as either moderate or severe) to healthycontrols.

They used the resources of the ribosomal database project togenerate a table of counts.

For further details:Pragman AA, Kim HB, Reilly CS, Wendt C, Isaacson RE (2012),“The lung microbiome in moderate and severe chronic obstructivepulmonary disease”, PLoS One, 7.

Page 173: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

First read in the count data and set up the group indicators.

> bactCnt <- read.table("/export/home/courses/ph7445/data/bactCounts.txt")

> dis <- factor(substring(colnames(bactCnt),1,3))> dim(bactCnt)[1] 142 32

Then initialize using the DGElist command.

> bactDEL<- DGEList(counts=bactCnt, group=dis)Calculating library sizes from column totals.

Page 174: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

Then set up the design matrix as we’ve seen before, and estimatethe dispersions.

> design <- model.matrix(~dis)> bactDEL <- estimateGLMTrendedDisp(bactDEL, design)

So now proceed to get the genus level dispersions.

> bactDEL <- estimateGLMTagwiseDisp(bactDEL, design)> fit <- glmFit(bactDEL, design)> lrt <- glmLRT(fit, coef=2)

Page 175: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

Then we can examine the top hits in terms of differences betweenthe moderate COPD patients and the controls.

Coefficient: disMOD

logFC logCPM LR PValue FDR

Brevibacillus -13.722702 14.810695 15.104481 0.0001017215 0.008070975

Oribacterium 9.701643 12.662380 14.464858 0.0001427989 0.008070975

Selenomonas 13.285029 12.824495 13.872045 0.0001956876 0.008070975

Atopobium 12.660858 11.703525 13.563390 0.0002306402 0.008070975

Prevotella 9.233926 13.455132 13.022587 0.0003077563 0.008070975

Actinomyces 7.734458 17.424870 12.830431 0.0003410271 0.008070975

Campylobacter 9.830081 9.116819 11.820702 0.0005857576 0.011882510

Centipeda 12.895376 12.722930 11.205179 0.0008156937 0.014478564

Bifidobacterium 11.624636 10.634536 9.787951 0.0017565909 0.025533975

Solobacterium 9.633208 9.067843 9.744943 0.0017981673 0.025533975

Page 176: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

Since some genera are represented by a single count in a singlesubject, we may want to consider doing some filtering.

Here is a summary of the total counts for all genera in this data set.

> apply(bactCnt,1,sum)

Saccharomonospora Ornithinimicrobium Micrococcus Cryptosporangium

1 1 1 1

Kineococcus Dolosigranulum Syntrophomonas Fastidiosipila

1 1 1 1

...

Enterococcus Pseudoramibacter Clostridium Modestobacter

9 9 9 10

Cryptobacterium Tannerella Dysgonomonas Sphingomonas

10 10 10 10

...

Streptococcus Corynebacterium Propionibacterium Actinomyces

26394 30727 75877 82189

Page 177: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

So let’s require that there are at least 10 counts total.

> bactCntF <- bactCnt[apply(bactCnt,1,sum) >= 10,]> bactDEL1 <- DGEList(counts=bactCntF, group=dis)> bactDEL1 <- estimateGLMTrendedDisp(bactDEL1,+ design)

So proceeding as before.

bactDEL1 <- estimateGLMTagwiseDisp(bactDEL1, design)

Now we’ll look at design matrix to determine what the parametersrepresent.

Page 178: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

> design(Intercept) disMOD disSVR

1 1 0 02 1 0 03 1 0 0

...12 1 1 013 1 1 014 1 1 0...30 1 0 131 1 0 132 1 0 1

so coefficient 2 in the model corresponds to a difference betweenthe moderates and the controls while coefficient 3 tests for adifference between severe and controls.

Page 179: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

So now we fit the GLM and conduct tests for the 2 coefficients.

By specifying the coefficient we can get genera that differ betweenthe controls and the moderates and the controls and the severepatients.

> fit1 <- glmFit(bactDEL1, design)> lrt1.2 <- glmLRT(fit1, coef=2)> lrt1.3 <- glmLRT(fit1, coef=3)

and we can examine the results sorted by p-values, first themoderates versus the controls

Page 180: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

> topTags(lrt1.2)

Coefficient: disMOD

logFC logCPM LR PValue FDR

Brevibacillus -13.721455 14.810756 14.872653 0.0001150184 0.004840491

Oribacterium 9.701546 12.662820 14.420692 0.0001461872 0.004840491

Selenomonas 13.285275 12.825141 13.862729 0.0001966601 0.004840491

Atopobium 12.661042 11.704238 13.548598 0.0002324650 0.004840491

Prevotella 9.233108 13.455509 12.943439 0.0003210447 0.004840491

Actinomyces 7.734822 17.425438 12.870362 0.0003338270 0.004840491

Campylobacter 9.830233 9.117536 11.790320 0.0005953950 0.007399910

Centipeda 12.895653 12.723502 11.201568 0.0008172823 0.008887945

Bifidobacterium 11.624857 10.635275 9.779017 0.0017651462 0.016029713

Solobacterium 9.633091 9.068224 9.700184 0.0018424958 0.016029713

Page 181: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomicsand here are the genera that differ between the severe cases andcontrols

> topTags(lrt1.3)

Coefficient: disSVR

logFC logCPM LR PValue FDR

Neisseria 15.393096 13.579971 17.929674 2.292192e-05 0.001810056

Oribacterium 11.144734 12.662820 16.796477 4.161048e-05 0.001810056

Selenomonas 13.032916 12.825141 13.203214 2.794694e-04 0.008104614

Centipeda 13.393287 12.723502 11.483554 7.021475e-04 0.015271707

Prevotella 8.419435 13.455509 11.056891 8.835820e-04 0.015374326

Veillonella 9.046095 11.290540 10.626291 1.114911e-03 0.016166215

Rothia 7.853258 15.426206 9.626303 1.918103e-03 0.019610812

Eubacterium 10.450272 9.359486 9.541963 2.008269e-03 0.019610812

Parvimonas 12.576958 10.787986 9.471517 2.086867e-03 0.019610812

Paenibacillus 16.763945 14.931613 9.295272 2.297460e-03 0.019610812

Note that Prevotella is in both lists: this genus has been found insamples from patients with respiratory infections and is found inthe mouths of patients with periodontal disease.

Some of the others (e.g. Neisseria) are known as very bad bacterialstrains.

Page 182: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

And we can further examine the results by accessing this table, forexample we may want to control the FDR while accounting fordependence among the tests rather than just using BenjaminiHochberg (which is what is reported in the table):

> padj1=p.adjust(lrt1.2$table$PValue, "BY")> summary(padj1)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.02444 0.22060 0.60170 0.61180 1.00000 1.00000> sum(padj1<.1)[1] 10

Page 183: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

We can use a similar approach to test for differences between thecontrol group and the severe group.

> padj2=p.adjust(lrt1.3$table$PValue, "BY")> summary(padj2)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.009139 0.232100 0.546100 0.604400 1.000000 1.000000> sum(padj2<.1)[1] 12

Page 184: Next Generation Sequencing - School of Public Healthcavanr/NGSlecturepubh74452013.pdf · 2013. 11. 22. · Next generation sequencing The fragments are then size selected, puri ed

Microbiomics

We can then examine the genera we are identifying with this veryconservative approach.

> rownames(bactCntF)[padj1<.1]

[1] "Solobacterium" "Campylobacter" "Bifidobacterium" "Atopobium"

[5] "Oribacterium" "Centipeda" "Selenomonas" "Prevotella"

[9] "Brevibacillus" "Actinomyces"

> rownames(bactCntF)[padj2<.1]

[1] "Eubacterium" "Campylobacter" "Fusobacterium" "Parvimonas"

[5] "Veillonella" "Oribacterium" "Centipeda" "Selenomonas"

[9] "Neisseria" "Prevotella" "Paenibacillus" "Rothia"

> intersect(rownames(bactCntF)[padj1<.1], rownames(bactCntF)[padj2<.1])

[1] "Campylobacter" "Oribacterium" "Centipeda" "Selenomonas"

[5] "Prevotella"