mapnext: a software tool for spliced and unspliced alignments and snp detection of short sequence...

14
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads 2009-09-10 Hua Bao Sun Yat-sen University, Guangzhou, China Evolution.sysu.edu.cn InCoB 2009

Upload: judith-cain

Post on 01-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads

2009-09-10

Hua Bao

Sun Yat-sen University, Guangzhou, China

Evolution.sysu.edu.cn

InCoB 2009

Next-generation sequencing

• High-throughput (tens of millions reads per lane)

• Read length is short (25-50bp)

• Sequencing error rate is relatively higher than Sanger sequencing

• Applications: genome sequencing, transcriptome sequencing, pooled population sequencing

The objective

1. Unspliced alignment of reads onto the genome

2. Spliced alignment of transcript reads over exon-intron boundaries

3. SNP detection from population sequences

Seed hash table

Read 1 TACACCACGGTCAGACTTGCATCACAACTGTTAAGC

Read 2 AGACTTGCATCACAACTGTTAAGCTACACCACGGTC

Read n … …

Seed hash table

TACACCACGGTC

Position 1, Read 1, + ; Position 25, Read 2, + ;…

AGACTTGCATCA

Position 13, Read 1, + ; Position 1, Read 2, +; …

TGATGCAAGTCT

Position 25, Read 1, - ; Position 13, Read 2, -; …

Other seed (K-mer) … …

GACCGTGGTGTA

Position 1, Read 1, - ; Position 25, Read 2, - ;…

Seed hash table

Coding A: 0 T: 1 G: 2 C: 3 k-mer CCGATTkey = 3*45 + 3*44 + 2*43 + 0*42 + 1*41 +1*40

Seed hash table

[0] (read id, position, strand)

[1]

[2]

[..]

[n] (1,1,+) (2,13,-) …

Reads

[0] Read sequence

[1] CCGATTGGCTAAA …

[2]

[..]

[n]

Key computation of the seed Key=n

Unspliced alignment

GenomeGenome TACACCACGGTCAGACTTGCATCA …

Seed hash table

[0] (read id, position,strand)

[1]

[2]

[3] (1,1,+) (2,13,-) …

[n]

Key=3

Reads

[0] Read sequence

[1]

[2]

[3]

[n]

ExtensionO(1)

K-mer:8-12bp Step-size: 1bp

Spliced alignment

GenomeGenome TACACCACGGTCAGACTTGCATCA …

Hash table

[0] (read id, posi,strand)

[1]

[2] (1,H,+) (2,T,-) …

[n]

Key=2

Seed hit list

[0] (Genome posi, read

posi, strand)

[1] (1,H,+) (780,T,+) …

[2] (1,T,-) …

O(1)

Reads

[0] Read sequence

[1] TACACCACG …

[2]

[n]

K-mer:6-10bp Step-size: 1bp

TACACCACGGTCAGA GTGCCATGGCTAGT

TACACCACGGTCAGAgtac … ccagGTGCCATGGCTAGT

1 780

Accuracy of alignment

A total of 1893118 reads (35bp length, 134274 spliced and 1758844 unspliced) from 5796 coding DNA sequences of chromosome I of Arabidopsis thaliana for the query dataset were simulated.

Program Unspliced alignment Spiced alignment

True positive (%)

False positive (%)

Running time (s)

True positive (%)

False positive (%)

Running time (m)

SHRiMP 94.79 8.97 809 N/A N/A N/A

SeqMap 96.50 6.71 447 N/A N/A N/A

SOAP 96.41 6.72 101 N/A N/A N/A

MAQ 96.53 6.73 138 N/A N/A N/A

Qpalma N/A

N/A N/A 84.17 4.45 557

MapNext 96.51 6.71 209 86.89 4.31 231

SNP detection from population sequences

… TACACACGGTCAGACTAGCATCAGTCCGTAATGCT … CACGGTCAGACGAGCATCAGTCC CACACGGTCAGACGAGCATCAGT GGTCAGACGAGCATCAGTCCGTA CAGACTAGCATCAGTCCGTAATG CACACGGTCAGACTAGCATCAGT GGTCAGACTAGCATCAGACCGTA GGTCAGACTAGCATCAGTCCGTA

CGGTCAGACTAGCATCAGTCCG

Quality control : minimum quality score (MQS), minimum neighbour quality score (MNQS)

Significance control : minimum coverage (MC) , minimum minor allele frequency (MMAF)

SNP detection from population sequences

N

N

N

Y

Clustered short reads

Reads that passed QC?

Polymorphism sites are covered by MC number of reads?

The frequency of minor allele is higher than MMAF?

Candidate SNPs

Y

Y

Accuracy of SNP detection from population sequencing

Coverage True positive False positive

4X 1961 (90.70%) 690 (29.51%)

6X 1998 (92.41 %) 23 (1.06%)

8X 2015 (93.20%) 8 (0.37%)

10X 2043 (94.50%) 0 (0.00%)

12X 2068 (95.65%) 0 (0.00%)

There were 2162 true SNPs in 50 individuals (haploid) in our simulation. Coverage equals sequencing depth per individual. MQV, MNQV, MMAF and MC were set at 25, 20, 0.01 and 50 (1X per individual), respectively.

Accuracy of MAF estimation from population sequencing

Real minor allele frequency

Est

imat

ed m

inor

alle

le f

requ

ency

0.00 0.06 0.12 0.18 0.24 0.30 0.36 0.42 0.48

0.00

0.04

0.08

0.12

0.16

0.20

0.24

0.28

0.32

0.36

0.40

0.44

0.48

Summary

1. MapNext supports both spliced and unspliced alignments of the short reads. And for spliced alignments, a training process is not needed.

2. MapNext can detect SNPs and estimate minor allele frequency from population sequences.

2009-09-10

Thank you!

MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads