snp scores. overall score coverage score * 4 optional scores ▫read balance score = 1 if reads are...

20
SNP Scores

Upload: samuel-phillips

Post on 19-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Interpreting the Score Scores are an empirical estimation of how likely it is that a given SNP is real and not an artifact of sequencing or alignment The score is based on Phred scores ▫30 = 1 in 1000 are not real ▫20 = 1 in 100 are not real ▫10 = 1 in 10 are not real

TRANSCRIPT

Page 1: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

SNP Scores

Page 2: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Overall Score• Coverage Score * 4 optional scores

▫Read Balance Score = 1 if reads are balanced in each direction

▫Allele Balance Score = 1 if SNP count is balanced in relation to the read count

in each direction▫Homopolymer Score

= 1 if the SNP is not an indel in a homopolymer▫Mismatch score

= 1 if there are fewer than 3 SNPs present within 10 bp on either side of the SNP that occur in a minimum number of reads

• Maximum score = 30*1*1*1*1 = 30

Page 3: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Interpreting the Score•Scores are an empirical estimation of how

likely it is that a given SNP is real and not an artifact of sequencing or alignment

•The score is based on Phred scores▫30 = 1 in 1000 are not real▫20 = 1 in 100 are not real▫10 = 1 in 10 are not real

Page 4: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Interpreting the Score•A low score does not mean the mutation is

more likely to be false- it only means the mutation cannot be confidently called as a true mutation.

•Even real SNPs will have low scores if the coverage is low. Coverage

Max Score (100% SNP)

Max Score (50% SNP)

1 5.83 -2 7.85 2.183 10.01 2.794 12.21 3.505 14.37 4.306 16.43 5.177 18.32 6.118 20.03 7.119 21.55 8.15

10 22.89 9.23

Page 5: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Optional Scores• The optional scores can be

ignored (set equal to 1) in the final score calculation by adjusting the settings for the mutation report. As you can see, The homopolymer score is always ignored unless it is Roche data.

Page 6: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Optional Score•You may want to ignore certain optional

scores depending on your data•For example: If your data is all (or nearly

all) one directional you can ignore choose to ignore the Read Balance score because even real SNPs will not be balanced

•Homopolymer scores are automatically ignored unless Roche data is being analyzed

Page 7: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Coverage Score•If the SNP count is greater than 50

(example: 50% SNP at 100 coverage) then the score is 30.

•Otherwise the score is calculated according to this formula:

#2.0410 )2)(%(log30

eeScore% = SNP Allele Percentage# = SNP Count

Page 8: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Coverage Score•This score is based on the Gompertz

function where a, b, and c have been adjusted to achieve the desired distribution.

Page 9: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Coverage Score Distribution•Different SNP Percentages vs Read Coverage:

1 10 100 1000 100001

10

100

1000

1%5%10%20%50%75%100%

Coverage

1 in

X S

NPs

are

Fal

se P

osit

ives

30

20

Score

10

Higher % SNPs are more reliable at low coverage

Page 10: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Coverage Score Distribution•Different Levels of Coverage vs SNP %

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

52050100200500

SNP Percentage

Scor

e

Low coverage will limit the score even if a SNP occurs in a high percentage of reads

Page 11: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Read Balance Score•If the number of forward and reverse

reads is within 1, then the score is 1. • If not the score is calculated according to

this formula:)(log5.0/#

110 CCF

Score

#F = number of forward readsC = Coverage

Page 12: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Read Balance Score•When sequence data has reads present in

both directions it is more reliable because the base quality is averaged out between the high quality 5’ end and the low quality 3’ end.

•A score of 1 means there is no penalty. A score below 1 reduces the score from the Coverage Score.

Page 13: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Read Balance Score DistributionLevels of Coverage vs Percent of Reads in

the Forward DirectionPercent of Reads in One Direction vs

Coverage

0 10 20 30 40 50 60 70 80 90 100

110

0

0.2

0.4

0.6

0.8

1

Coverage

591001000

% Reads in Forward Direction

Scor

e

10 100 1000 100000

0.20.40.60.8

11.2

% in One Direction

10254050

Coverage (log scale)

Scor

e

Lower coverage results in a higher penalty because the balance is

more likely to be random

Page 14: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Allele Balance•The Allele Balance score penalizes SNPs

that occur at different frequencies in the forward and reverse directions because they are more likely to be sequencing or alignment errors.

•The score is based on a Yate’s chi-square test which is less likely than normal chi-square tests to reject the null hypothesis due to a lack of data (low coverage in this case).

Page 15: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Allele Balance•First a variable is calculated:

▫ W = |(#F SNP)*(#R non-SNP) – (#R SNP)*(# F non-SNP)|- C/2

•If this variable is negative then the score is 1.

•Otherwise, the score is calculated according to the equation: )(#*)(#*)(#*)(#

12

SNPnonSNPRFWScore

#F = number of forward reads#R = number of reverse reads

Page 16: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Allele Balance DistributionVary Imbalance

Score vs Number of Forward SNPs100 reads in each direction, 50% SNPs

0 10 20 30 40 50 60 70 80 90 1000

0.20.40.60.8

11.2

Scor

e

0 10 20 30 40 50 60 70 80 90 1000

0.20.40.60.8

11.2

Scor

e

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.20.40.60.8

1

Scor

e

Vary CoverageScore vs CoverageBalanced reads, 2:1 SNP Balance, 30% SNPs

Vary SNP PercentageScore vs percent of reads with a SNP allele300 reads in each direction, 2:1 SNP Balance

Page 17: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Homopolymer Score•The homopolymer score penalizes indels

in homopolymer regions when analyzing Roche pyrosequencing data because they are usually a sequencing error.

•The penalty is higher for longer homopolymer regions because error is more likely.

Page 18: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Homopolymer Score•The program first determines which

length of homopolymer region is present more often (A) and less often (B)

•If A or B is not ≥ 3 then the score is 1•Otherwise the score is calculated

according to the formula:ABA

Score

Example: A deletion from 4 bases to 3 bases that occurs less than half of the time:A = 4, B = 3, Score = 0.5

Page 19: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Mismatch Score•Several SNPs occurring very close together is

usually the result of an alignment error. This score penalizes a SNP if there are other SNPs nearby.

•The program first looks for SNPs that occur in a minimum percentage of reads in the 10 bp on either side of the SNP being scored. The number of SNPs is used to calculate the score.

•If the number of nearby SNPs is less than 3 there is no penalty.

Page 20: SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score

Mismatch Score Distribution•After the number of nearby SNPs is

determined the score is calculated according to the formula:

•This results in the following distribution:

8145.095.0 2N

Score

0 1 2 3 4 5 6 7 8 9 10 110

0.20.40.60.8

11.2

Number of Nearby SNPs

Scor

e