snp calling from next generation sequencing data

SNP calling from Next Generation Sequencing data

Hugo WillyFeb 1, 2012

Overview

Sequencing of samples

Mapping of the reads to Ref

Genome

Genotype determination for each locus

Comparison against

reference/normalCall SNPs

MAQ Li Heng et al. Genome Res. 2008

• Main concept:– Mapping quality

where u is the mapping position, x is the reference genome and z is the read. ps(u|x,z) is the correctness probability of the mapping location u given x and the read z.

p(z|x,u) = product of error probabilities of the mismatches of the read z mapped to pos u. The error is defined by the base quality e.g. base quality 20 denotes a probability of 10-(20/10) of miscalling the base.

Assume p(u|x) is uniform

MAQ (1)

• In practice, instead of summing over the genome, they use

• which effectively says

“read the supplementary for details”

MAQ (2)

• Uses paired end information to help mapping– The mapping quality of both reads are set to the sum

of their S score when there is a unique hit on both ends.

– Otherwise, just use the original S score for each read.– The paired ends is also used to identify short indel

• when there is a perfect mapping on one side and a very poor mapping on the other, use (local) Smith Waterman to find a gapped alignment within the paired end gap distance.

MAQ (3)• MAQ call SNPs separately for each library. Somatic SNPs are

found by checking the genotype of Normal and Tumor after SNPs are called.

• No paired end feature is considered for SNP calling.• At any genomic position, at most two different nucleotides can

be legitimately seen (assuming biallelic SNPs). Therefore, we can consider only the two most frequent nucleotides at any position and ignore others as errors.

• Assume we are observing data D which consist of k nucleotides b and n-k nucleotides b’. with (b,b’) {A,C,G,T}∈ and b ≠ b’. Then the three possible reference genotypes are <b,b>, <b,b’>, and <b’,b’>.

MAQ (4)

• If the true genotype is <b,b>, we have n-k errors from n bases. Denote it using P(D|<b,b>) = αn,n-k

• If the true genotype is <b’,b’>, we have k errors from n bases. Denote it using P(D|<b’,b’>) = αnk

• If the true genotype is <b,b’>, the probability can be approximated with a binomial distribution:

MAQ (5)• Assuming that the errors are not independent,

– where εi is the ith smallest base error probability.– c’nk is a function of εi but varies little with εi

– θ is empirically set to 0.85

• Prior probabilitiesLet P(<b,b’>) = r, then P(<b,b>) = P(<b’,b’>) = (1 - r)/2– r is the probability of observing a heterozygote– if there is a known SNP in the position (dbSNP or OMIM), r = 0.2.

Otherwise, r=0.001. One can also use site specific allele frequency.

MAQ (6)

• Then the genotype is selected by

with a quality ofwhere P(g|D) is the posterior probability of genotype g given the observation D.

MAQ (7)

• Rule based filtering– Discard SNPs within the 3-bp flanking region around

a potential indel; – Discard SNPs covered by three or fewer reads; – Discard SNPs covered by no read with a mapping

quality higher than 60; – In any 10-bp window, if there are three or more

SNPs, discard them all; and – Discard SNPs with consensus quality smaller than 10

MAQ (8)

• Validation– Simulation data on Human Chr X with 45X reads.• The reference genome is duplicated and a set of variant

positions are randomly picked based on a predetermined mutation rate of 0.001• For each position, 2/3 of the time it will be mutated

into a <b,b’> genotype and 1/3 of the time it will be mutated in to <b’,b’>.• Indels are created similarly.

MAQ (9)– On bacterial genome (S. paratyphi)• The dataset is single end reads on S. paratyphi strain

AKU12601 (haploid genome).• They checked the SNP calls against the strain’s

reference genome. They found 6 SNPs that looks highly convincing – 2 homozygotes variant – one confirmed using

capillary sequencing, one have 19 read cover, all having the variant allele)

– 4 heterozygotes variant-they suspected there exist a repeat region that is not included in current reference.

MAQ (10)– They use another strain’s genome (ATCC9150) as the

reference genome for False Negative estimation.– The SNPs and indels are determined from the

differences between the two genomes.– Recovered 175/211 SNPs. Indels are not identified

because of single end read data.

GATK

• Aaron McKenna, Matthew Hanna, Eric Banks• Genome Res. 2010.• Main selling points:– MapReduce framework: allow users to code own

functions. By following the specifications, the function will be parallelizable and also able to use shared memory for performance.

– Implemented using JAVA 1.6– Is a general / platform independent tool for processing

Next Generation Sequencing (NGS) data.

GATK (1)

• Implements the traversals in Samtools– TraverseLoci: each single base loci with its

associated read, reference ordered data and ref. base are presented

– Traverse reads: each reads is traversed– TraverseDuplicates: returns list of unique and

duplicate reads on the position– TraverseLocusWindow: just like traverseLoci for

intervals

GATK (2)

• Simple Bayesian Genotyper

G is the genotype, D is the observed data

• p(D) is constant while

b is the base observed on the reads

• No mention on the P(G) in the paper. These can be estimated from known SNP data (?)

GATK (3)

• Where the probability of seeing a base b given a diploid genotype G is the average of the probabilities of seeing b given the haploid genotype {A1,A2} constituting G i.e.

• These probabilities are computed based on the base calling quality e (converted back to probability value)

GATK (4)

• The model was validated using the chr1 data of one of the 1000 genome sequence (NA12878).

• Out of the 315,202 SNPs called (variant when compared to the ref. genome), 81.7% are listed in dbSNP (concordance 99.76%).

• The more rigorous implementation is done in the newest samtools paper (Li Heng, 2011, Bioinformatics).

SOAPsnp

• Ruiqiang, Li. Genome Research, 2009• BGI SNP caller• Same Bayesian based for genotype calling

• The priors for each genotype Ti ϵ {AA,AC, …,TT} is defined as follows

SOAPsnp (2)

I don’t understand this…

SOAPsnp (3)• Transitions (A<->C and G<->T) are four times

more likely as compared to transversions among substitution mutations. Hence, assuming the ref is G,

I don’t understand this either…

SOAPsnp (4)

• For a total of n observed alleles/reads on a locus

• where,

for a diploid genotype T = {Hm,Hn}

SOAPsnp (5)

• How to compute P(dk|H)?– This is the probability of observing an allele in the

set of reads covering the position with a Haploid genotype H

– They use the following equations

– where• ok = observed nucleotide• qk = the quality of the read• ck= the position on the read (computed from 5’)

SOAPsnp (6)

SOAPsnp (7)

• They assume that P(qk|H) are independent from the underlying genotype H

• That is, it is purely a function of qk

• When there are t reads mapping to the same genomic location, they are ordered by their base quality on the locus in question.

• The tth observation will have its quality score recalibrated by

SOAPsnp (8)

• “Real” reads mapping locations should follow a Poisson distribution

• The penalty for duplicate reads mapping to the same location is set such that the “sum” of the read’s score contribution mimics the contribution of reads whose mapping location follows such distribution

SOAPsnp (9)

SOAPsnp (10)

• For predicted heterozygous sites, the genotype probability is corrected by a ranksum p-value on the base qualities of the two most frequent alleles.

• Suppose the predicted genotype is AG where the reads with A have base call qualities {70,60,60,50,30,20,10,10} and reads with G have {50,20,10,10}.

• If the distributions of base qualities are deemed to be significantly different, this means that one of them has worse overall qualities than the other.

• This may arise from errors instead of real heterozygosity

SOAPsnp (11)

• They did a pretty impressive analysis to further improve their results. They use two tests– Simulation: • Use Chromosome 12 as template, • Generate SNPs and indels with a rate of 0.001• Generate single end and paired end reads, 36X• Base qualities from real sequencing are selected at

random and read errors are introduced correspondingly.– Real Genome of an unknown Han Chinese. • They use microarray to validate 1,038,923 SNPs as

references (ground-truth).

SOAPsnp (12)

• They find that 95% SNP arising from read errors occurs only in one read (simulation).– They set a minimum read support >= 4

• When the error has also high frequency, its frequency discrepancy (w.r.t the most frequent allele) must also have P<=0.0001 according to the binomial distribution (what about the ranksum?)

• They also found that indels may generate a small amount of SNP call error (0.03%, simulation)

SOAPsnp (13)

• Rule based filtering– We required at least two reads for haploid chromosome X/Y

and four reads for diploid autosomes. The reads need to have base quality scores of 20 (Q20) at the position.

– The overall depth, including randomly placed repetitive hits, had to be less than 100.

– The approximate copy number of flanking sequences had to be less than two. This was done in order to avoid misreading SNPs as heterozygotes caused by the alignment of similar reads from repeat units or by copy number variations (CNVs).

– There had to be at least one paired end read. – The SNPs had to be at least 5 bp away from each other.

SOAPsnp (14)

• For positions with known SNP (listed in dbSNP) they use a prior SNP rate of 0.1 (compared to 0.2 in MAQ).

• For regions with low depth, they use Q10 filtering instead of Q20.

• These steps are shown to improve coverage on the genotyped ground truth SNPs.

VarScan

• Koboldt et al. Bioinformatics App Note, 2009• Rule based filtering

– For each predicted variant, VarScan determines the overall coverage, as well as the number of supporting reads, average base quality, and number of strands observed for each allele.

– Thresholds for coverage, quality, variant frequency, and/or number of reads required to call a variant are set automatically with the easyrun command, but can be manually adjusted by the user.

• Selling point: can use output from different sequencing technology (454, Solid, Illumina) and different mapper (BLAT, Newbler, BOWTIE, cross_match and Novoalign).

SNVMix

• Goya et al. Bioinformatics 2010.• Shah SP group, Univ. British Columbia.• This is the first program that takes into

account both the Tumor and Normal reads simultaneously

• Selling point: remove hard thresholds that they argue will be different for Tumor and Normals. Instead, learn it using EM.

SNVMix (1)

• The EM details is NOT going to be covered here• They showed that (using simulation) the model

performs equally well for datasets where the mean certainty of the base calls is 80% and ∼higher.

• This suggests that thresholding base calls at Phred Q20 [99% certainty (Morin et al., 2008)] or Q30 [99.9% certainty (Ley et al., 2008)] may be overly stringent.

SNVMix (2)• Validation

– We evaluate the model based on real data derived from 16 ovarian cancer transcriptomes sequenced using NGS (RNAseq), and a lobular breast cancer genome sequenced to >40x coverage (Shah et al., 2009b).

– For all cases, we obtained high density genotyping array data for orthogonal comparisons.

– Finally, we demonstrate results on 497 positions from the breast cancer genome that were subjected to Sanger sequencing and thus constitute a ‘ground truth’ dataset for benchmarking.

– The 497 positions are predictions of SNVMix1 (a more basic model) that are resequenced using Sanger sequencing. 305 is confirmed as SNPs while 192 is not.

SNVMix (3)

• The RNAseq data on ovarian cancer is used for cross validation.

• The result of the CV indicated that sequencing depth thresholding reduces the AUC score – the best performance comes from the model without the thresholding

• Overfitting?

SNVMix (3)

• They further obtained 14649 SNP positions on the breast cancer genome using SNP array

• They tested the model on the 497 Sanger-sequenced SNPs.

MutationSeq

• Ding et al, Bioinformatics 2012.• Shah SP group, Univ. British Columbia.• Selling points: machine learning on Samtools

and GATK features is better than hard thresholding

• They (rather, the machine learning) found that thresholds for Normal and Tumor genomes are different (sound familiar? SNVMix?)

MutationSeq (1)• They use 106 features, 20 from Samtools, 20 from GATK

and 26 of their own features.• They use:

– Random forest, Bayesian additive regression tree, SVM and LI regularized Logistic Regression

• Datasets:– Exome capture of 48 triple negative breast cancer, 76 bp pair

end reads– 3369 are selected using hard thresholds and resequenced up to

6000X with 1015 somatic, 471 germline and 1883 false positives.– Another set of SOLiD 25-50bp pair end reads are prepared on

the whole genome. Using orthogonal exome capture, they get 113 somatic, 57 germline and 337 FP sites.

MutationSeq (2)

• Best Classifier = BART & SVM

MutationSeq (3)

• Using the features, they identify the source of FPs– Low mapping quality on Normal’s variant while

High mapping quality for tumor – Strand bias by PCR– GGT->GGG misreading

snp calling from next generation sequencing data

Documents