mapping of next generation sequencing data · 2019. 3. 28. · sequencing done in a massively...

Mapping of Next Generation Sequencing Data

Agnes Hotz-Wagenblatt

Bioinformatik (HUSAR)

Next Generation Sequencers

Next (or 3rd) generation sequencers came onto the scene in the early 2000’sGeneral characteristics include:

Amplification of genetic material by PCRLigation of amplified material to a solid surfaceSequence of the target genetic material is determined using Sequence-by-Synthesis (using labelled nucleotides or pyrosequencing for detection) or Sequence by ligationSequencing done in a massively parallel fashion and sequence information is captured by a computer

Sanger sequencing

• DNA is fragmented• Cloned to a plasmid

vector• Cyclic sequencing

reaction• Separation by

electrophoresis• Readout with

fluorescent tags

Cyclic-array methods

• DNA is fragmented• Adaptors ligated to

fragments• Several possible protocols

yield array of PCR colonies.

• Enyzmatic extension with fluorescently tagged nucleotides.

• Cyclic readout by imaging the array.

Emulsion PCR

• Fragments, with adaptors, are PCR amplified within a water drop in oil.

• One primer is attached to the surface of a bead. • Used by 454, Polonator and SOLiD.

Bridge PCR

• DNA fragments are flanked with adaptors.• A flat surface coated with two types of primers, corresponding to the

adaptors.• Amplification proceeds in cycles, with one end of each bridge

tethered to the surface.• Used by Solexa.

Comparison of existing methods

Read length and pairing

Short reads are problematic, because short sequences do not map uniquely to the genome.

Solution #1: Get longer reads.Solution #2: Get paired reads.

ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

Third generation

Nanopore sequencingNucleic acids driven through a nanopore.Differences in conductance of pore provide readout.

Real-time monitoring of PCR activityRead-out by fluorescence resonance energy transfer

between polymerase and nucleotides orWaveguides allow direct observation of polymerase and

fluorescently labeled nucleotides

Analysis tasks

Base calling / polymorphism detectionMapping to a reference genomeDe novo or assisted genome assembly

Next Gen. Sequencers Cont.

Sequencing platform ABI3730xl Genome Analyzer

Roche (454) FLX Illumina Genome Analyzer

ABI SOLiD HeliScope

Sequencing chemistry Automated Sanger sequencing

Pyrosequencing on solid support

Sequencing-by- synthesis with reversible terminators

Sequencing by ligation

Sequencing-by- synthesis with virtual terminators

Template amplification method

In vivo amplification via cloning

Emulsion PCR Bridge PCR Emulsion PCR None (single molecule)

Read length 700–900 bp 200–300 bp 32–40 bp 35 bp 25–35 bp

Sequencing throughput 0.03–0.07 Mb/h 13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h

Usage of SequencingResequencing−

Map reads back to genome

−

Call bases

RNA-seq−


−

Count tags to determine gene expression levels

Chip Seq−


−

Peaks determine binding sites.

Nearly all experiments have the same first step!

Bioinformatics

Because of the massively parallel nature of next gen sequencers, huge amounts of data are produced quickly requiring terabytes of storageNew bioinformatics tools were developed to utilize the huge number of much shorter reads (~35bp vs ~800bp)

Bowtie - Ultrafast, memory-efficient short read aligner SOAPdenovo - Part of the SOAP suite, used to build reference genomeTopHat - TopHat is a fast splice junction mapper for RNA-Seq reads

Hash table (Lookup table)- fast, but requires perfect matches

Array Scanning- can handle mismatches, but not gaps

Dynamic Programming eq Smith-Waterman- Indels, mathematically optimal, slow- most programs use hash mapping as prefilter

Burrows-Wheeler Transform- fast and memory efficient, but less suited for- gaps and mismatches

Mapping Methods

Programs

Briefings in Bioinformatics Advance Access published online on May 11, 2010 Heng Li and Nils Homer A survey of sequence alignment algorithms for next-generation sequencing

Hash Tables:Eland, SOAP, SeqMap, MAQ, RMAP, ZOOM, Novoalign

BW transform, FM index:Bowtie, BWA, SOAP2, BWA-SW

Hash tableA hash table is a data structure that stores things and allows insertions, lookups, and deletions to be performed in O(1) time.An algorithm converts an object, typically a string, to a number. Then the number is compressed accordingto the size of the table and used as an index.There is the possibility of distinct items being mappedto the same key. This is called a collision and must be resolved.

Key Hash Code Generator Number Compression Index

Smith 7

0123

987654

Bob Smith123 Main St.

Orlando, FL 327816407-555-1111

[email protected]

First Hash Table Lookup

Blast was the first program using this algorithm

Kmer (11mer by default) seed from the query searched inAll database sequences, keeps result in hash tables.

Hash table algorithm

Each tool builds a hash table of short oligomers present in- either the reads (SHRiMP, Maq, RMAP, and ZOOM) - or the reference (SOAP).

ZOOM uses 'spaced seeds' to significantly outperform RMAP, Algorithm Yaetes and Perleberg .

Spaced seeds have been shown to yield higher sensitivity than contiguous seeds of the same length .

SHRiMP employs a combination of spaced seeds and the Smith-Waterman algorithm to align reads at expense of speed.

Eland is a commercial alignment program available from Illumina that uses a hash-based algorithm with spaced seeds to align reads.

Spaced seedsA template ‘111010010100110111’requiring 11 matches at the ‘1’ positions is 55% more sensitive than BLAST’s default template ‘11111111111’for two sequences of 70% similarity.

A seed allowing internal mismatches is called spaced seed; the number of matches in the seed is its weight.

Eland was the first program that utilized spaced seed in short-read alignment. It uses six seed templates spanning the entire short read such that a two-mismatch hit is guaranteed to be identified by at least one of the templates,

SOAP adopts almost the same strategy except that it indexes the genome rather than reads.

SeqMap and MAQ extends the method to allow k-mismatches,

Used by Bowtie:Langmead et al. Genome Biology 2009 10:R25 doi:10.1186/gb-2009-10-3-r25

Borrows Wheeler Transform

identifying exact matches and building inexact alignments supported by exact matches.

First step by:suffix tree, enhanced suffix array or FM-index.

The design of the FM-index is based upon the relationship between the Burrows-Wheeler compression algorithm and the suffix array data structure.

The advantage of using a trie is that alignment to multiple identical copies of a substring in the reference is only needed to be done once.

S

= M A L A Y A L A M $1 2 3 4 5 6 7 8 9 10

$YALAM$

M

$

ALAYALAM$

$M

YALAM$

$M

YALAM$

$M

YA

LAM

$

A

AL

LA

6 2

8 4 7 3

1 9

5 10

What is a suffix tree?

Finding a (short) Pattern in a (long) String

Build a suffix tree of the string.Starting from the root, traverse a path matching characters of

the pattern.If stuck, pattern not present in string.

Otherwise, each leaf below gives a position of the pattern in the string.

Find “ALA”$YALAM$

M

$

ALAYALAM$

M$

YALAM$

M$

YALAM$

M$

YA

LAM

$

A

AL

LA

6 2

8 4 7 3

1 9

5 10

Two matches -

at 6 and 2

Finding a Pattern in a String

Suffix Array

Suffixe of abracadabra (11): abracadabra bracadabra racadabra etc.Order lexicographically:

a abra abracadabra acadabra adabra bra bracadabra cadabra dabra ra racadabra

The suffix array is a array of indices starting with 1 or 0 in lexicographical order. For the string "abracadabra" the suffix array is {11,8,1,4,6,9,2,5,7,10,3}, because suffix "a" starts at the 11th letter, "abra" starts at the 8th letter, etc.

Sort the rows

mississippi#ississippi#mssissippi#misissippi#misissippi#missssippi#missisippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

# mississipp iI #mississip pI ppi#missis sI ssippi#mis sI ssissippi# mM ississippi #P i#mississi pP pi#mississ iS ippi#missi sS issippi#mi sS sippi#miss iS sissippi#m i

F L• Every column is a permutation of T.

• Given row i, char L[i] precedes F[i] in

original T.

• Consecutive char’s in L are adjacent to

similar strings in T.

• Therefore – L usually contains long runs of

identical char’s.

1. Find F by sorting L 2. First char of T? m3. Find m in L4. L[i] precedes F[i] in T. Therefore we

get mi5. How do we choose the correct i in L?

The i’s are in the same order in L and FAs are the rest of the char’s

6. i is followed by s: mis7. And so on….

F

Reminder: Recovering T from L

L

MTFL

i p s s m # p i s s i i

# i m p s0 1 2 3 4

1 3 13 4 4 4 440 0 0L

i # m p s0 1 2 3 4

p i # m s0 1 2 3 4

s p i # m0 1 2 3 4

And so on…• Bad example

• For larger texts we will receive more runs of zeroes, and dominancy of smaller numbers.

• The reason being that BWT creates clusters of similar char’s.

Replace each char in L with the number of distinct char’sseen since its last occurrence.

Keep MTF[1,…,|Σ|] array, sorted lexicographically.

Runs of identical char’s are transformed into runs of zeroes in L(MTF)

Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column).

Copyright restrictions may apply.

Li, H. et al. Brief Bioinform 2010 0:bbq015v1-15; doi:10.1093/bib/bbq015

Data structures based on a prefix tree

(A) Prefix trie of string AGGAGC where symbol ^ marks the start of the string. The two numbers in each node give the suffix array interval of the substring represented by the node, which is the string concatenation of edge symbols from the node to the root. (B) Compressed prefix trie by contracting nodes with in- and out-degree both being one. (C) Prefix tree by representing the substring on each edge as the interval on the original string. (D) Prefix directed word graph (prefix DAWG) created by collapsing nodes of the prefix trie with identical suffix array interval. (E) Constructing the suffix array and Burrows–Wheeler transform of AGGAGC.

String: AGGAGC

Exact matching versus inexact alignment.Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'.

Role of paired-end and mate-pair mapping

Some sequencing technologies produce read pairs such that the two readsare known to be close to each other in physical chromosomal distance.

These reads are called paired-end or mate-pair reads.

- With this mate-pair information, a repetitive read will be reliably placed if its mate can be placed unambiguously.

- Alignment errors may be detected and fixed when wrong alignments break the mate-pair requirement

Effect of paired end alignment

Effect of quality values

Aligning bisulfite-treated reads

Bisulfite sequencing is a technology to identify methylation patterns

- Cytosines with underlines are not methylated. - Denaturation and bisulfite treatment will convert these cytosines to uracils. - After amplification, four different sequences from the original

double-strand DNA result.

Aligning bisulfite reads

1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment.

2) Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in blue) but not the reverse.

Xi and Li BMC Bioinformatics 2009 10:232 doi:10.1186/1471-2105-10-232

Aligning bisufite treated reads -two reference sequences:

one with all ‘C’ bases converted to ‘T’ bases (the C-to-T reference) the other with all ‘G’ bases converted to ‘A’ bases (the G-to-A reference).

-alignment: ‘C’ bases are converted to ‘T’ base for reads and

are mapped to the C-to-T reference (then a C–T mismatch is effectively regarded as a match);

a similar procedure is performed for the G-to-A conversion in the next round of alignment.

-The results from two rounds of alignment are combined to generate the final report. If there are no mutations or sequencing errors, a bisulfite treated read

can always be mapped exactly in one of the two rounds.

Aligning spliced reads

RNA-seq produces reads from transcribed sequences with introns and intergenetic regions excluded.

When RNA-seq reads are aligned against the genomic sequence, a read may be mapped to a splicing junction.

This will fail with a standard alignment algorithm.

-> Special alignment e.g. TopHat

Copyright restrictions may apply.

Trapnell, C. et al. Bioinformatics 2009 25:1105-1111; doi:10.1093/bioinformatics/btp120

TOPHAT Pipeline

SplicingSplicing

Eukaryotic genes (exons & introns)

TranslationTranslation

SplicingAlternative

Mature splice variant II

Mature splice variant I

Alternative splicing: One gene, several proteins!

Types of alternative

splicing

TopHat and Cufflinks

- Use next generation sequenceData for alternative splicing

Comparison of some mapping programsTable 1: Popular short-read alignment software

Program Algorithm SOLiD Longa Gapped PEb Qc

Bfast hashing ref. Yes No Yes Yes No

Bowtie FM-index Yes No No Yes Yes

BWA FM-index Yesd Yese Yes Yes No

MAQ hashing reads Yes No Yesf Yes Yes

Mosaik hashing ref. Yes Yes Yes Yes No

Novoaligng hashing ref. No No Yes Yes Yes

aWork

well for Sanger and 454 reads, allowing gaps and clipping.

bPaired

end mapping.

cMake

use of base quality in alignment.

dBWA

trims the primer base and the first

color

for a

color

read.

eLong-read alignment implemented in the BWA-SW module.

fMAQ

only does gapped alignment for

Illumina

paired-end reads.

gFree

executable for non-profit projects only.

mapping of next generation sequencing data · 2019. 3. 28. · sequencing done in a massively...

Documents