snp scores. overall score coverage score * 4 optional scores ▫read balance score = 1 if reads are...
DESCRIPTION
Interpreting the Score Scores are an empirical estimation of how likely it is that a given SNP is real and not an artifact of sequencing or alignment The score is based on Phred scores ▫30 = 1 in 1000 are not real ▫20 = 1 in 100 are not real ▫10 = 1 in 10 are not realTRANSCRIPT
SNP Scores
Overall Score• Coverage Score * 4 optional scores
▫Read Balance Score = 1 if reads are balanced in each direction
▫Allele Balance Score = 1 if SNP count is balanced in relation to the read count
in each direction▫Homopolymer Score
= 1 if the SNP is not an indel in a homopolymer▫Mismatch score
= 1 if there are fewer than 3 SNPs present within 10 bp on either side of the SNP that occur in a minimum number of reads
• Maximum score = 30*1*1*1*1 = 30
Interpreting the Score•Scores are an empirical estimation of how
likely it is that a given SNP is real and not an artifact of sequencing or alignment
•The score is based on Phred scores▫30 = 1 in 1000 are not real▫20 = 1 in 100 are not real▫10 = 1 in 10 are not real
Interpreting the Score•A low score does not mean the mutation is
more likely to be false- it only means the mutation cannot be confidently called as a true mutation.
•Even real SNPs will have low scores if the coverage is low. Coverage
Max Score (100% SNP)
Max Score (50% SNP)
1 5.83 -2 7.85 2.183 10.01 2.794 12.21 3.505 14.37 4.306 16.43 5.177 18.32 6.118 20.03 7.119 21.55 8.15
10 22.89 9.23
Optional Scores• The optional scores can be
ignored (set equal to 1) in the final score calculation by adjusting the settings for the mutation report. As you can see, The homopolymer score is always ignored unless it is Roche data.
Optional Score•You may want to ignore certain optional
scores depending on your data•For example: If your data is all (or nearly
all) one directional you can ignore choose to ignore the Read Balance score because even real SNPs will not be balanced
•Homopolymer scores are automatically ignored unless Roche data is being analyzed
Coverage Score•If the SNP count is greater than 50
(example: 50% SNP at 100 coverage) then the score is 30.
•Otherwise the score is calculated according to this formula:
#2.0410 )2)(%(log30
eeScore% = SNP Allele Percentage# = SNP Count
Coverage Score•This score is based on the Gompertz
function where a, b, and c have been adjusted to achieve the desired distribution.
Coverage Score Distribution•Different SNP Percentages vs Read Coverage:
1 10 100 1000 100001
10
100
1000
1%5%10%20%50%75%100%
Coverage
1 in
X S
NPs
are
Fal
se P
osit
ives
30
20
Score
10
Higher % SNPs are more reliable at low coverage
Coverage Score Distribution•Different Levels of Coverage vs SNP %
0 10 20 30 40 50 60 70 80 90 1000
5
10
15
20
25
30
35
52050100200500
SNP Percentage
Scor
e
Low coverage will limit the score even if a SNP occurs in a high percentage of reads
Read Balance Score•If the number of forward and reverse
reads is within 1, then the score is 1. • If not the score is calculated according to
this formula:)(log5.0/#
110 CCF
Score
#F = number of forward readsC = Coverage
Read Balance Score•When sequence data has reads present in
both directions it is more reliable because the base quality is averaged out between the high quality 5’ end and the low quality 3’ end.
•A score of 1 means there is no penalty. A score below 1 reduces the score from the Coverage Score.
Read Balance Score DistributionLevels of Coverage vs Percent of Reads in
the Forward DirectionPercent of Reads in One Direction vs
Coverage
0 10 20 30 40 50 60 70 80 90 100
110
0
0.2
0.4
0.6
0.8
1
Coverage
591001000
% Reads in Forward Direction
Scor
e
10 100 1000 100000
0.20.40.60.8
11.2
% in One Direction
10254050
Coverage (log scale)
Scor
e
Lower coverage results in a higher penalty because the balance is
more likely to be random
Allele Balance•The Allele Balance score penalizes SNPs
that occur at different frequencies in the forward and reverse directions because they are more likely to be sequencing or alignment errors.
•The score is based on a Yate’s chi-square test which is less likely than normal chi-square tests to reject the null hypothesis due to a lack of data (low coverage in this case).
Allele Balance•First a variable is calculated:
▫ W = |(#F SNP)*(#R non-SNP) – (#R SNP)*(# F non-SNP)|- C/2
•If this variable is negative then the score is 1.
•Otherwise, the score is calculated according to the equation: )(#*)(#*)(#*)(#
12
SNPnonSNPRFWScore
#F = number of forward reads#R = number of reverse reads
Allele Balance DistributionVary Imbalance
Score vs Number of Forward SNPs100 reads in each direction, 50% SNPs
0 10 20 30 40 50 60 70 80 90 1000
0.20.40.60.8
11.2
Scor
e
0 10 20 30 40 50 60 70 80 90 1000
0.20.40.60.8
11.2
Scor
e
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
0.20.40.60.8
1
Scor
e
Vary CoverageScore vs CoverageBalanced reads, 2:1 SNP Balance, 30% SNPs
Vary SNP PercentageScore vs percent of reads with a SNP allele300 reads in each direction, 2:1 SNP Balance
Homopolymer Score•The homopolymer score penalizes indels
in homopolymer regions when analyzing Roche pyrosequencing data because they are usually a sequencing error.
•The penalty is higher for longer homopolymer regions because error is more likely.
Homopolymer Score•The program first determines which
length of homopolymer region is present more often (A) and less often (B)
•If A or B is not ≥ 3 then the score is 1•Otherwise the score is calculated
according to the formula:ABA
Score
Example: A deletion from 4 bases to 3 bases that occurs less than half of the time:A = 4, B = 3, Score = 0.5
Mismatch Score•Several SNPs occurring very close together is
usually the result of an alignment error. This score penalizes a SNP if there are other SNPs nearby.
•The program first looks for SNPs that occur in a minimum percentage of reads in the 10 bp on either side of the SNP being scored. The number of SNPs is used to calculate the score.
•If the number of nearby SNPs is less than 3 there is no penalty.
Mismatch Score Distribution•After the number of nearby SNPs is
determined the score is calculated according to the formula:
•This results in the following distribution:
8145.095.0 2N
Score
0 1 2 3 4 5 6 7 8 9 10 110
0.20.40.60.8
11.2
Number of Nearby SNPs
Scor
e