mapping of next generation sequencing data · 2019. 3. 28. · sequencing done in a massively...

45
Mapping of Next Generation Sequencing Data Agnes Hotz-Wagenblatt Bioinformatik (HUSAR)

Upload: others

Post on 24-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Mapping of Next Generation Sequencing Data

Agnes Hotz-Wagenblatt

Bioinformatik (HUSAR)

Page 2: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Next Generation Sequencers

Next (or 3rd) generation sequencers came onto the scene in the early 2000’sGeneral characteristics include:

Amplification of genetic material by PCRLigation of amplified material to a solid surfaceSequence of the target genetic material is determined using Sequence-by-Synthesis (using labelled nucleotides or pyrosequencing for detection) or Sequence by ligationSequencing done in a massively parallel fashion and sequence information is captured by a computer

Page 3: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to
Page 4: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Sanger sequencing

• DNA is fragmented• Cloned to a plasmid

vector• Cyclic sequencing

reaction• Separation by

electrophoresis• Readout with

fluorescent tags

Page 5: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Cyclic-array methods

• DNA is fragmented• Adaptors ligated to

fragments• Several possible protocols

yield array of PCR colonies.

• Enyzmatic extension with fluorescently tagged nucleotides.

• Cyclic readout by imaging the array.

Page 6: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Emulsion PCR

• Fragments, with adaptors, are PCR amplified within a water drop in oil.

• One primer is attached to the surface of a bead. • Used by 454, Polonator and SOLiD.

Page 7: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Bridge PCR

• DNA fragments are flanked with adaptors.• A flat surface coated with two types of primers, corresponding to the

adaptors.• Amplification proceeds in cycles, with one end of each bridge

tethered to the surface.• Used by Solexa.

Page 8: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Comparison of existing methods

Page 9: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Read length and pairing

Short reads are problematic, because short sequences do not map uniquely to the genome.

Solution #1: Get longer reads.Solution #2: Get paired reads.

ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

Page 10: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Third generation

Nanopore sequencingNucleic acids driven through a nanopore.Differences in conductance of pore provide readout.

Real-time monitoring of PCR activityRead-out by fluorescence resonance energy transfer

between polymerase and nucleotides orWaveguides allow direct observation of polymerase and

fluorescently labeled nucleotides

Page 11: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Analysis tasks

Base calling / polymorphism detectionMapping to a reference genomeDe novo or assisted genome assembly

Page 12: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Next Gen. Sequencers Cont.

Sequencing platform ABI3730xl Genome Analyzer

Roche (454) FLX Illumina Genome Analyzer

ABI SOLiD HeliScope

Sequencing chemistry Automated Sanger sequencing

Pyrosequencing on solid support

Sequencing-by- synthesis with reversible terminators

Sequencing by ligation

Sequencing-by- synthesis with virtual terminators

Template amplification method

In vivo amplification via cloning

Emulsion PCR Bridge PCR Emulsion PCR None (single molecule)

Read length 700–900 bp 200–300 bp 32–40 bp 35 bp 25–35 bp

Sequencing throughput 0.03–0.07 Mb/h 13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h

Page 13: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Usage of SequencingResequencing−

Map reads back to genome

Call bases

RNA-seq−

Map reads back to genome

Count tags to determine gene expression levels

Chip Seq−

Map reads back to genome

Peaks determine binding sites.

Nearly all experiments have the same first step!

Page 14: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Bioinformatics

Because of the massively parallel nature of next gen sequencers, huge amounts of data are produced quickly requiring terabytes of storageNew bioinformatics tools were developed to utilize the huge number of much shorter reads (~35bp vs ~800bp)

Bowtie - Ultrafast, memory-efficient short read aligner SOAPdenovo - Part of the SOAP suite, used to build reference genomeTopHat - TopHat is a fast splice junction mapper for RNA-Seq reads

Page 15: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Hash table (Lookup table)- fast, but requires perfect matches

Array Scanning- can handle mismatches, but not gaps

Dynamic Programming eq Smith-Waterman- Indels, mathematically optimal, slow- most programs use hash mapping as prefilter

Burrows-Wheeler Transform- fast and memory efficient, but less suited for- gaps and mismatches

Mapping Methods

Page 16: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Programs

Briefings in Bioinformatics Advance Access published online on May 11, 2010 Heng Li and Nils Homer A survey of sequence alignment algorithms for next-generation sequencing

Hash Tables:Eland, SOAP, SeqMap, MAQ, RMAP, ZOOM, Novoalign

BW transform, FM index:Bowtie, BWA, SOAP2, BWA-SW

Page 17: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Hash tableA hash table is a data structure that stores things and allows insertions, lookups, and deletions to be performed in O(1) time.An algorithm converts an object, typically a string, to a number. Then the number is compressed accordingto the size of the table and used as an index.There is the possibility of distinct items being mappedto the same key. This is called a collision and must be resolved.

Page 18: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Key Hash Code Generator Number Compression Index

Smith 7

0123

987654

Bob Smith123 Main St.

Orlando, FL 327816407-555-1111

[email protected]

Page 19: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

First Hash Table Lookup

Blast was the first program using this algorithm

Kmer (11mer by default) seed from the query searched inAll database sequences, keeps result in hash tables.

Page 20: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Hash table algorithm

Each tool builds a hash table of short oligomers present in- either the reads (SHRiMP, Maq, RMAP, and ZOOM) - or the reference (SOAP).

ZOOM uses 'spaced seeds' to significantly outperform RMAP, Algorithm Yaetes and Perleberg .

Spaced seeds have been shown to yield higher sensitivity than contiguous seeds of the same length .

SHRiMP employs a combination of spaced seeds and the Smith-Waterman algorithm to align reads at expense of speed.

Eland is a commercial alignment program available from Illumina that uses a hash-based algorithm with spaced seeds to align reads.

Page 21: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Spaced seedsA template ‘111010010100110111’requiring 11 matches at the ‘1’ positions is 55% more sensitive than BLAST’s default template ‘11111111111’for two sequences of 70% similarity.

A seed allowing internal mismatches is called spaced seed; the number of matches in the seed is its weight.

Eland was the first program that utilized spaced seed in short-read alignment. It uses six seed templates spanning the entire short read such that a two-mismatch hit is guaranteed to be identified by at least one of the templates,

SOAP adopts almost the same strategy except that it indexes the genome rather than reads.

SeqMap and MAQ extends the method to allow k-mismatches,

Page 22: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Used by Bowtie:Langmead et al. Genome Biology 2009 10:R25 doi:10.1186/gb-2009-10-3-r25

Borrows Wheeler Transform

identifying exact matches and building inexact alignments supported by exact matches.

First step by:suffix tree, enhanced suffix array or FM-index.

The design of the FM-index is based upon the relationship between the Burrows-Wheeler compression algorithm and the suffix array data structure.

The advantage of using a trie is that alignment to multiple identical copies of a substring in the reference is only needed to be done once.

Page 23: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

S

= M A L A Y A L A M $1 2 3 4 5 6 7 8 9 10

$YALAM$

M

$

ALAYALAM$

$M

YALAM$

$M

YALAM$

$M

YA

LAM

$

A

AL

LA

6 2

8 4 7 3

1 9

5 10

What is a suffix tree?

Page 24: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Finding a (short) Pattern in a (long) String

Build a suffix tree of the string.Starting from the root, traverse a path matching characters of

the pattern.If stuck, pattern not present in string.

Otherwise, each leaf below gives a position of the pattern in the string.

Page 25: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Find “ALA”$YALAM$

M

$

ALAYALAM$

M$

YALAM$

M$

YALAM$

M$

YA

LAM

$

A

AL

LA

6 2

8 4 7 3

1 9

5 10

Two matches -

at 6 and 2

Finding a Pattern in a String

Page 26: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Suffix Array

Suffixe of abracadabra (11): abracadabra bracadabra racadabra etc.Order lexicographically:

a abra abracadabra acadabra adabra bra bracadabra cadabra dabra ra racadabra

The suffix array is a array of indices starting with 1 or 0 in lexicographical order. For the string "abracadabra" the suffix array is {11,8,1,4,6,9,2,5,7,10,3}, because suffix "a" starts at the 11th letter, "abra" starts at the 8th letter, etc.

Page 27: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Sort the rows

mississippi#ississippi#mssissippi#misissippi#misissippi#missssippi#missisippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

# mississipp iI #mississip pI ppi#missis sI ssippi#mis sI ssissippi# mM ississippi #P i#mississi pP pi#mississ iS ippi#missi sS issippi#mi sS sippi#miss iS sissippi#m i

F L• Every column is a permutation of T.

• Given row i, char L[i] precedes F[i] in

original T.

• Consecutive char’s in L are adjacent to

similar strings in T.

• Therefore – L usually contains long runs of

identical char’s.

Page 28: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

1. Find F by sorting L 2. First char of T? m3. Find m in L4. L[i] precedes F[i] in T. Therefore we

get mi5. How do we choose the correct i in L?

The i’s are in the same order in L and FAs are the rest of the char’s

6. i is followed by s: mis7. And so on….

F

Reminder: Recovering T from L

L

Page 29: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

MTFL

i p s s m # p i s s i i

# i m p s0 1 2 3 4

1 3 13 4 4 4 440 0 0L

i # m p s0 1 2 3 4

p i # m s0 1 2 3 4

s p i # m0 1 2 3 4

And so on…• Bad example

• For larger texts we will receive more runs of zeroes, and dominancy of smaller numbers.

• The reason being that BWT creates clusters of similar char’s.

Replace each char in L with the number of distinct char’sseen since its last occurrence.

Keep MTF[1,…,|Σ|] array, sorted lexicographically.

Runs of identical char’s are transformed into runs of zeroes in L(MTF)

Page 30: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column).

Page 31: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Copyright restrictions may apply.

Li, H. et al. Brief Bioinform 2010 0:bbq015v1-15; doi:10.1093/bib/bbq015

Data structures based on a prefix tree

(A) Prefix trie of string AGGAGC where symbol ^ marks the start of the string. The two numbers in each node give the suffix array interval of the substring represented by the node, which is the string concatenation of edge symbols from the node to the root. (B) Compressed prefix trie by contracting nodes with in- and out-degree both being one. (C) Prefix tree by representing the substring on each edge as the interval on the original string. (D) Prefix directed word graph (prefix DAWG) created by collapsing nodes of the prefix trie with identical suffix array interval. (E) Constructing the suffix array and Burrows–Wheeler transform of AGGAGC.

String: AGGAGC

Page 32: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Exact matching versus inexact alignment.Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'.

Page 33: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Role of paired-end and mate-pair mapping

Some sequencing technologies produce read pairs such that the two readsare known to be close to each other in physical chromosomal distance.

These reads are called paired-end or mate-pair reads.

- With this mate-pair information, a repetitive read will be reliably placed if its mate can be placed unambiguously.

- Alignment errors may be detected and fixed when wrong alignments break the mate-pair requirement

Page 34: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Effect of paired end alignment

Page 35: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Effect of quality values

Page 36: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Aligning bisulfite-treated reads

Bisulfite sequencing is a technology to identify methylation patterns

- Cytosines with underlines are not methylated. - Denaturation and bisulfite treatment will convert these cytosines to uracils. - After amplification, four different sequences from the original

double-strand DNA result.

Page 37: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Aligning bisulfite reads

1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment.

2) Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in blue) but not the reverse.

Xi and Li BMC Bioinformatics 2009 10:232 doi:10.1186/1471-2105-10-232

Page 38: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Aligning bisufite treated reads -two reference sequences:

one with all ‘C’ bases converted to ‘T’ bases (the C-to-T reference) the other with all ‘G’ bases converted to ‘A’ bases (the G-to-A reference).

-alignment: ‘C’ bases are converted to ‘T’ base for reads and

are mapped to the C-to-T reference (then a C–T mismatch is effectively regarded as a match);

a similar procedure is performed for the G-to-A conversion in the next round of alignment.

-The results from two rounds of alignment are combined to generate the final report. If there are no mutations or sequencing errors, a bisulfite treated read

can always be mapped exactly in one of the two rounds.

Page 39: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Aligning spliced reads

RNA-seq produces reads from transcribed sequences with introns and intergenetic regions excluded.

When RNA-seq reads are aligned against the genomic sequence, a read may be mapped to a splicing junction.

This will fail with a standard alignment algorithm.

-> Special alignment e.g. TopHat

Page 40: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Copyright restrictions may apply.

Trapnell, C. et al. Bioinformatics 2009 25:1105-1111; doi:10.1093/bioinformatics/btp120

TOPHAT Pipeline

Page 41: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

SplicingSplicing

Eukaryotic genes (exons & introns)

TranslationTranslation

Page 42: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

SplicingAlternative

Mature splice variant II

Mature splice variant I

Alternative splicing: One gene, several proteins!

Page 43: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Types of alternative

splicing

Page 44: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

TopHat and Cufflinks

- Use next generation sequenceData for alternative splicing

Page 45: Mapping of Next Generation Sequencing Data · 2019. 3. 28. · Sequencing done in a massively parallel fashion and sequence information is captured by a computer. ... Count tags to

Comparison of some mapping programsTable 1: Popular short-read alignment software

Program Algorithm SOLiD Longa Gapped PEb Qc

Bfast hashing ref. Yes No Yes Yes No

Bowtie FM-index Yes No No Yes Yes

BWA FM-index Yesd Yese Yes Yes No

MAQ hashing reads Yes No Yesf Yes Yes

Mosaik hashing ref. Yes Yes Yes Yes No

Novoaligng hashing ref. No No Yes Yes Yes

aWork

well for Sanger and 454 reads, allowing gaps and clipping.

bPaired

end mapping.

cMake

use of base quality in alignment.

dBWA

trims the primer base and the first

color

for a

color

read.

eLong-read alignment implemented in the BWA-SW module.

fMAQ

only does gapped alignment for

Illumina

paired-end reads.

gFree

executable for non-profit projects only.