mapnext: a software tool for spliced and unspliced alignments and snp detection of short sequence...
TRANSCRIPT
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads
2009-09-10
Hua Bao
Sun Yat-sen University, Guangzhou, China
Evolution.sysu.edu.cn
InCoB 2009
Next-generation sequencing
• High-throughput (tens of millions reads per lane)
• Read length is short (25-50bp)
• Sequencing error rate is relatively higher than Sanger sequencing
• Applications: genome sequencing, transcriptome sequencing, pooled population sequencing
The objective
1. Unspliced alignment of reads onto the genome
2. Spliced alignment of transcript reads over exon-intron boundaries
3. SNP detection from population sequences
Seed hash table
Read 1 TACACCACGGTCAGACTTGCATCACAACTGTTAAGC
Read 2 AGACTTGCATCACAACTGTTAAGCTACACCACGGTC
Read n … …
Seed hash table
TACACCACGGTC
Position 1, Read 1, + ; Position 25, Read 2, + ;…
AGACTTGCATCA
Position 13, Read 1, + ; Position 1, Read 2, +; …
TGATGCAAGTCT
Position 25, Read 1, - ; Position 13, Read 2, -; …
Other seed (K-mer) … …
GACCGTGGTGTA
Position 1, Read 1, - ; Position 25, Read 2, - ;…
Seed hash table
Coding A: 0 T: 1 G: 2 C: 3 k-mer CCGATTkey = 3*45 + 3*44 + 2*43 + 0*42 + 1*41 +1*40
Seed hash table
[0] (read id, position, strand)
[1]
[2]
[..]
[n] (1,1,+) (2,13,-) …
Reads
[0] Read sequence
[1] CCGATTGGCTAAA …
[2]
[..]
[n]
Key computation of the seed Key=n
Unspliced alignment
GenomeGenome TACACCACGGTCAGACTTGCATCA …
Seed hash table
[0] (read id, position,strand)
[1]
[2]
[3] (1,1,+) (2,13,-) …
[n]
Key=3
Reads
[0] Read sequence
[1]
[2]
[3]
[n]
ExtensionO(1)
K-mer:8-12bp Step-size: 1bp
Spliced alignment
GenomeGenome TACACCACGGTCAGACTTGCATCA …
Hash table
[0] (read id, posi,strand)
[1]
[2] (1,H,+) (2,T,-) …
[n]
Key=2
Seed hit list
[0] (Genome posi, read
posi, strand)
[1] (1,H,+) (780,T,+) …
[2] (1,T,-) …
O(1)
Reads
[0] Read sequence
[1] TACACCACG …
[2]
[n]
K-mer:6-10bp Step-size: 1bp
TACACCACGGTCAGA GTGCCATGGCTAGT
TACACCACGGTCAGAgtac … ccagGTGCCATGGCTAGT
1 780
Accuracy of alignment
A total of 1893118 reads (35bp length, 134274 spliced and 1758844 unspliced) from 5796 coding DNA sequences of chromosome I of Arabidopsis thaliana for the query dataset were simulated.
Program Unspliced alignment Spiced alignment
True positive (%)
False positive (%)
Running time (s)
True positive (%)
False positive (%)
Running time (m)
SHRiMP 94.79 8.97 809 N/A N/A N/A
SeqMap 96.50 6.71 447 N/A N/A N/A
SOAP 96.41 6.72 101 N/A N/A N/A
MAQ 96.53 6.73 138 N/A N/A N/A
Qpalma N/A
N/A N/A 84.17 4.45 557
MapNext 96.51 6.71 209 86.89 4.31 231
SNP detection from population sequences
… TACACACGGTCAGACTAGCATCAGTCCGTAATGCT … CACGGTCAGACGAGCATCAGTCC CACACGGTCAGACGAGCATCAGT GGTCAGACGAGCATCAGTCCGTA CAGACTAGCATCAGTCCGTAATG CACACGGTCAGACTAGCATCAGT GGTCAGACTAGCATCAGACCGTA GGTCAGACTAGCATCAGTCCGTA
CGGTCAGACTAGCATCAGTCCG
Quality control : minimum quality score (MQS), minimum neighbour quality score (MNQS)
Significance control : minimum coverage (MC) , minimum minor allele frequency (MMAF)
SNP detection from population sequences
N
N
N
Y
Clustered short reads
Reads that passed QC?
Polymorphism sites are covered by MC number of reads?
The frequency of minor allele is higher than MMAF?
Candidate SNPs
Y
Y
Accuracy of SNP detection from population sequencing
Coverage True positive False positive
4X 1961 (90.70%) 690 (29.51%)
6X 1998 (92.41 %) 23 (1.06%)
8X 2015 (93.20%) 8 (0.37%)
10X 2043 (94.50%) 0 (0.00%)
12X 2068 (95.65%) 0 (0.00%)
There were 2162 true SNPs in 50 individuals (haploid) in our simulation. Coverage equals sequencing depth per individual. MQV, MNQV, MMAF and MC were set at 25, 20, 0.01 and 50 (1X per individual), respectively.
Accuracy of MAF estimation from population sequencing
Real minor allele frequency
Est
imat
ed m
inor
alle
le f
requ
ency
0.00 0.06 0.12 0.18 0.24 0.30 0.36 0.42 0.48
0.00
0.04
0.08
0.12
0.16
0.20
0.24
0.28
0.32
0.36
0.40
0.44
0.48
Summary
1. MapNext supports both spliced and unspliced alignments of the short reads. And for spliced alignments, a training process is not needed.
2. MapNext can detect SNPs and estimate minor allele frequency from population sequences.