01/$&)234& - ut

39
!"#$% '""( )*+ ,-./01/$ )234 '/5#/67%6( 89":-$0 1%";

Upload: others

Post on 07-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 01/$&)234& - ut

!"#$%&'""(&

)*+&,-./01/$&)234&

'/5#/67%6(&89":-$0&1%";&

Page 2: 01/$&)234& - ut

<&6-=/&-6&>%(>&7-./$"(/&,?'&',@&$/8$-A#7%1%9%=B&

CDE,FGD&'@D<CDH&I&J$+&K%7>"/9&'6BA/$&

'="6L-$A&M6%./$;%=B&'7>--9&-L&K/A%7%6/&

Page 3: 01/$&)234& - ut

!"#$%"&'#($)%*+',-&N@/$;-6"9&F0%7;&@$-O9%6(P&K-6%=-$%6(&?/6-0/;Q&G$"6;7$%8=-0/;&"6A&-=>/$&F0/;&J#$%6(&R/"9=>B&"6A&J%;/";/&'="=/;S&&./0&%123456&-7895/&'="6L-$A&M6%./$;%=B&'7>--9&-L&K/A%7%6/&

•  '"0/&;"089/&;/5#/67/A&=T%7/&T%=>&;"0/&0/=>-A;&

•  I99#0%6"Q&:;<&2=>5/4?5&

•  H#6&3P&4+*UV&K&',@;&7"99/A&

•  H#6&)P&4+*)W&K&',@;&7"99/A&

•  XY&-L&=-="9&U&K&#6%5#/&',@;&A-6Z=&-./$9"8&A#/&=-&6-[7"99;&

•  FL&=>/&-./$9"88%6(&',@;Q&2+2WY&"$/&A%;7-$A"6=&

Page 4: 01/$&)234& - ut

!"#$%"&'#($)%*+',-&./0&%123456&-7895/&

Page 5: 01/$&)234& - ut

\-089/=/&?/6-0%7;&.;+&I99#0%6"&

@$/;/6="]-6&/08>";%;&-6&

-./$9"8&"6A&7-67-$A"67/&-L&',^;&&_;%6(9/&6#79/-]A/&."$%"6=;`&

Page 6: 01/$&)234& - ut

•  3&%6A%.%A#"9&;/5#/67/A&-6&I99#0%6"&"6A&\?&89":-$0&

•  )&J,<&;-#$7/;P&19--A&_0-6-6#79/"$&7/99;`&"6A&;"9%."&

•  I99#0%6"&R%'/5&)222P&323[18&8"%$/A[/6A&$/"A;&

•  \?P&4W[18&8"%$/A[/6A&$/"A;&

•  I99#0%6"&&

•  0"88/A&T%=>&ab<&

•  ',^;&7"99/A&T%=>&?<GC&

•  \?&0"88/A&"6A&7"99/A&T%=>&\?&8%8/9%6/&

!"0&/=&"9+&)23)&

Page 7: 01/$&)234& - ut

<./$"(/&7-./$"(/&cXd&

\?&9/;;&#6%L-$0&%6&7-./$"(/&

©20

12 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

ANALYS I S

NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 1 JANUARY 2012 79

Extensive differences in variant callingWe sought to compare the sensitivity and accuracy of each platform for SNV calling. In total, 88.1% (3,295,023 out of 3,739,701) of the unique SNVs were concordant—that is, either a homozygous or heterozygous SNV was detected at the same locus by the two platforms in at least one sample (Fig. 2a). We detected 444,678 SNVs by only one platform or the other but not both, of which 345,100 were specific to Illumina (10.5% of the Illumina combined SNVs) and 99,578 were CG-specific (3.0% of the CG combined SNVs). Among the Illumina-specific SNVs, 67% were ‘no-calls’ (that is, not a reference or variant call), 11% were reference calls and 22% were other types of calls (that is, complex and substitution calls) in CG. Similarly, 75% of the CG-specific SNVs were no-calls in Illumina, and 25% were reference calls (Fig. 2b). The higher percentage of no-calls in Illumina is likely because GATK does not make the complex and substitution calls as does the CG pipeline.

To assess the quality of the calls, we used four criteria: the transition/transversion ratio (ti/tv), quality scores, the heterozygous/homozygous call ratio and novel, platform-specific SNVs. The ti/tv ratio of 2.1 for SNVs in humans has been described in several previous studies, including the 1000 Genomes Project11. The ti/tv ratio for all of the SNVs detected in these genomes was 2.04, but in our data the ti/tv of SNVs concordant between the two platforms was 2.14. For all SNVs detected by the Illumina platform, ti/tv was 2.05, but for SNVs specific to Illumina it was only 1.40. Similarly, for SNVs detected by CG, ti/tv was 2.13, but for CG-specific SNVs, it was 1.68. Thus, the ti/tv of concordant SNVs was very close to that expected, whereas the platform-specific ti/tv was much lower, suggesting that the platform-specific calls were of lower accuracy. Inspection of the quality scores of the platform-specific SNVs showed that they were indeed lower than those for the concordant calls (Supplementary Fig. 1). Furthermore, the heterozygous/homozygous call ratio was 1.48 for the concordant calls, whereas the platform-specific ratios were indeed higher: 2.48 for Illumina-specific calls and 1.98 for CG-specific calls.

To examine the fraction of novel platform-specific SNVs, we noted that 3,160,905 (96.0%) of the concordant SNVs were present in dbSNP131 (ref. 15). In contrast, only 260,108 (75.4%) of the SNVs in the Illumina-specific set, and 72,735 (73.0%) of the SNVs in the CG-specific set were present in dbSNP131. Thus, the platform-specific call sets were enriched for novel SNVs, suggesting that they likely contain more errors. In addition, the overall genotype concordance rate (that is, the proportion of concordant calls having a consistent genotype—heterozygous or homozygous—across both platforms) for the concordant SNVs was 98.9%. The high genotype concordance rate and percentage of known SNVs indicate that the concordant SNVs were of high quality and accuracy.

To further assess the accuracy of the variant calling, we sought to vali-date our SNVs by using Omni Quad 1M Genotyping arrays, traditional Sanger sequencing and Agilent SureSelect target enrichment capture followed by sequencing on an Illumina HiSeq for both samples. Of the 260,112 heterozygous calls detected with the Omni array, 99.5% were present in the entire SNV data set, 99.34% were concordant calls and only 0.16% were platform-specific SNVs. This demonstrates that both plat-forms are sensitive to known SNVs and that few known single-nucleotide polymorphisms (SNPs) are detected by only one platform.

To directly determine accuracy, we sequenced randomly selected concordant and platform-specific regions for Sanger sequencing. We found that 20 of 20 concordant SNVs could be validated, whereas 2 of 15 (13.3%) Illumina-specific and 17 of 18 (94.4%) CG-specific SNVs could be validated. This suggests CG has higher accuracy than Illumina and that almost all the concordant calls are correct.

To attempt to examine accuracy on a larger scale, we used Agilent SureSelect target enrichment capture technology to capture 33,084 (9.6%) Illumina-specific, 3,015 (3.0%) CG-specific and 24,247 (0.7%) concordant SNVs for sequencing on an Illumina Hi-Seq instrument (Table 2). We found that the validation rate for the concordant SNVs was 92.7%, whereas the validation rate was 61.9% and 64.3% for the CG-specific and Illumina-specific SNVs. These results indicate that the platform-specific calls have a very high false-positive rate of at least 35%. We also found that 12.6–21.4% of the targeted SNVs were not called in the validation, possibly owing to nonunique regions that are difficult to map precisely. Because the capture validation was performed using Illumina DNA sequencing technology, it is diffi-cult to directly compare the Illumina versus CG SNV rates with this approach. Nonetheless, these overall results indicate that concordant SNVs have high accuracy and platform-specific SNVs have a high false-positive rate.

Association of genes with variant calling differencesTo better understand the platform-specific calls, we investigated the association of SNVs from each platform with different genomic elements. We annotated both the platform-specific SNVs and con-cordant SNVs with gene and repeat annotations using Annovar16. In general, we did not find a significant difference between the associa-tions of the platform-specific SNVs and the concordant SNVs with gene elements, such as exons and introns (Fig. 3a,b). For example, 1% and 32–38% of the platform-specific SNVs were associated with exonic and intronic regions, respectively, regardless of the platform. This correlates well with the portions of exons (~1.3%) and introns (~37%) in the whole human genome. Nonetheless, the CG-specific SNVs had a slightly stronger association (14%) with noncoding RNA than the Illumina-specific SNVs (12%) and concordant SNVs (11%). Overall, many platform-specific SNVs lie in RNA coding regions of the human genome, and thus deducing their accuracy is of high importance.

Per

cent

age

of g

enom

e

100a

100

9080

80

7060

60

5040

40Cumulative read depth

3020

20

100

0

CompleteGenomicsIllumina

CompleteGenomicsIllumina

Num

ber

of b

ases

(M

)

b100

100Read depth

150 200

9080706050

50

40302010

00

Figure 1 Genome coverage at different read depths. (a) Percentage of genome covered by different read depths in different platforms. (b) Histogram of genome coverage at different read depths.

Table 1 Whole-genome sequencing using CG and Illumina platformsCG Illumina

Sample Bases (Gb) Coverage (×) Bases (Gb) Coverage (×) Reads Mapped After duplicate removal

Blood 233.2 78 151.4 50 1,499,021,500 1,367,988,241 91% 1,233,937,084 82%Saliva 218.6 73 307.1 102 3,040,306,840 2,614,663,882 86% 2,354,594,740 77%Total 451.8 151 458.5 153 4,539,328,340 3,982,652,123 88% 3,588,531,824 79%

©20

12 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

ANALYS I S

NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 1 JANUARY 2012 79

Extensive differences in variant callingWe sought to compare the sensitivity and accuracy of each platform for SNV calling. In total, 88.1% (3,295,023 out of 3,739,701) of the unique SNVs were concordant—that is, either a homozygous or heterozygous SNV was detected at the same locus by the two platforms in at least one sample (Fig. 2a). We detected 444,678 SNVs by only one platform or the other but not both, of which 345,100 were specific to Illumina (10.5% of the Illumina combined SNVs) and 99,578 were CG-specific (3.0% of the CG combined SNVs). Among the Illumina-specific SNVs, 67% were ‘no-calls’ (that is, not a reference or variant call), 11% were reference calls and 22% were other types of calls (that is, complex and substitution calls) in CG. Similarly, 75% of the CG-specific SNVs were no-calls in Illumina, and 25% were reference calls (Fig. 2b). The higher percentage of no-calls in Illumina is likely because GATK does not make the complex and substitution calls as does the CG pipeline.

To assess the quality of the calls, we used four criteria: the transition/transversion ratio (ti/tv), quality scores, the heterozygous/homozygous call ratio and novel, platform-specific SNVs. The ti/tv ratio of 2.1 for SNVs in humans has been described in several previous studies, including the 1000 Genomes Project11. The ti/tv ratio for all of the SNVs detected in these genomes was 2.04, but in our data the ti/tv of SNVs concordant between the two platforms was 2.14. For all SNVs detected by the Illumina platform, ti/tv was 2.05, but for SNVs specific to Illumina it was only 1.40. Similarly, for SNVs detected by CG, ti/tv was 2.13, but for CG-specific SNVs, it was 1.68. Thus, the ti/tv of concordant SNVs was very close to that expected, whereas the platform-specific ti/tv was much lower, suggesting that the platform-specific calls were of lower accuracy. Inspection of the quality scores of the platform-specific SNVs showed that they were indeed lower than those for the concordant calls (Supplementary Fig. 1). Furthermore, the heterozygous/homozygous call ratio was 1.48 for the concordant calls, whereas the platform-specific ratios were indeed higher: 2.48 for Illumina-specific calls and 1.98 for CG-specific calls.

To examine the fraction of novel platform-specific SNVs, we noted that 3,160,905 (96.0%) of the concordant SNVs were present in dbSNP131 (ref. 15). In contrast, only 260,108 (75.4%) of the SNVs in the Illumina-specific set, and 72,735 (73.0%) of the SNVs in the CG-specific set were present in dbSNP131. Thus, the platform-specific call sets were enriched for novel SNVs, suggesting that they likely contain more errors. In addition, the overall genotype concordance rate (that is, the proportion of concordant calls having a consistent genotype—heterozygous or homozygous—across both platforms) for the concordant SNVs was 98.9%. The high genotype concordance rate and percentage of known SNVs indicate that the concordant SNVs were of high quality and accuracy.

To further assess the accuracy of the variant calling, we sought to vali-date our SNVs by using Omni Quad 1M Genotyping arrays, traditional Sanger sequencing and Agilent SureSelect target enrichment capture followed by sequencing on an Illumina HiSeq for both samples. Of the 260,112 heterozygous calls detected with the Omni array, 99.5% were present in the entire SNV data set, 99.34% were concordant calls and only 0.16% were platform-specific SNVs. This demonstrates that both plat-forms are sensitive to known SNVs and that few known single-nucleotide polymorphisms (SNPs) are detected by only one platform.

To directly determine accuracy, we sequenced randomly selected concordant and platform-specific regions for Sanger sequencing. We found that 20 of 20 concordant SNVs could be validated, whereas 2 of 15 (13.3%) Illumina-specific and 17 of 18 (94.4%) CG-specific SNVs could be validated. This suggests CG has higher accuracy than Illumina and that almost all the concordant calls are correct.

To attempt to examine accuracy on a larger scale, we used Agilent SureSelect target enrichment capture technology to capture 33,084 (9.6%) Illumina-specific, 3,015 (3.0%) CG-specific and 24,247 (0.7%) concordant SNVs for sequencing on an Illumina Hi-Seq instrument (Table 2). We found that the validation rate for the concordant SNVs was 92.7%, whereas the validation rate was 61.9% and 64.3% for the CG-specific and Illumina-specific SNVs. These results indicate that the platform-specific calls have a very high false-positive rate of at least 35%. We also found that 12.6–21.4% of the targeted SNVs were not called in the validation, possibly owing to nonunique regions that are difficult to map precisely. Because the capture validation was performed using Illumina DNA sequencing technology, it is diffi-cult to directly compare the Illumina versus CG SNV rates with this approach. Nonetheless, these overall results indicate that concordant SNVs have high accuracy and platform-specific SNVs have a high false-positive rate.

Association of genes with variant calling differencesTo better understand the platform-specific calls, we investigated the association of SNVs from each platform with different genomic elements. We annotated both the platform-specific SNVs and con-cordant SNVs with gene and repeat annotations using Annovar16. In general, we did not find a significant difference between the associa-tions of the platform-specific SNVs and the concordant SNVs with gene elements, such as exons and introns (Fig. 3a,b). For example, 1% and 32–38% of the platform-specific SNVs were associated with exonic and intronic regions, respectively, regardless of the platform. This correlates well with the portions of exons (~1.3%) and introns (~37%) in the whole human genome. Nonetheless, the CG-specific SNVs had a slightly stronger association (14%) with noncoding RNA than the Illumina-specific SNVs (12%) and concordant SNVs (11%). Overall, many platform-specific SNVs lie in RNA coding regions of the human genome, and thus deducing their accuracy is of high importance.

Pe

rce

nta

ge

of

ge

no

me

100a

100

9080

80

7060

60

5040

40Cumulative read depth

3020

20

100

0

CompleteGenomicsIllumina

CompleteGenomicsIllumina

Nu

mb

er

of

ba

ses

(M)

b100

100Read depth

150 200

9080706050

50

40302010

00

Figure 1 Genome coverage at different read depths. (a) Percentage of genome covered by different read depths in different platforms. (b) Histogram of genome coverage at different read depths.

Table 1 Whole-genome sequencing using CG and Illumina platformsCG Illumina

Sample Bases (Gb) Coverage (×) Bases (Gb) Coverage (×) Reads Mapped After duplicate removal

Blood 233.2 78 151.4 50 1,499,021,500 1,367,988,241 91% 1,233,937,084 82%Saliva 218.6 73 307.1 102 3,040,306,840 2,614,663,882 86% 2,354,594,740 77%Total 451.8 151 458.5 153 4,539,328,340 3,982,652,123 88% 3,588,531,824 79%

I!&

!"0&/=&"9+&)23)&

Page 8: 01/$&)234& - ut

•  J"="&L$-0&19--A&"6A&;"9%."&7-01%6/A&

•  ',^;&A/=/7=/A&L$-0&-69B&-6/&;-#$7/&A%;7"$A/A&

•  F69B&L/T&-L&=>/&];;#/[;8/7%O7&7"99;&7-#9A&1/&."9%A"=/A&1B&%6A/8/6A/6=&0/=>-A;&

©20

12 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

ANALYS I S

80 VOLUME 30 NUMBER 1 JANUARY 2012 NATURE BIOTECHNOLOGY

To further ascertain whether the platform-specific SNVs might be located in functionally important regions, we examined whether the variant calls were present in the Varimed data-base2,17, which contains variants catalogued through genome-wide association studies and other genetic linkage studies. We found that 31 Illumina- and 3 CG-specific SNVs were present in Varimed, from which we were able to estimate associations between diseases and platform-specific SNPs (Supplementary Table 1). One of these, rs2672598, was called in both PBMCs and saliva by the Illumina platform, but not called in either PBMCs or saliva by the CG platform. This SNP is at the 5 end of HTRA1 and known to increase the risk of age-related macular degeneration by 4.89-fold (P = 3.39 × 10−11)18,19. Another example is the A202T allele in the TERT gene encoding telomerase. This allele has been associated with aplastic ane-mia20 and was only detected by the Illumina platform. Thus, some platform-specific calls are of high importance.

Association of repetitive regions with variant calling differencesIn contrast to coding SNVs, we found that overall the platform- specific SNVs had a substantially stronger association with repeti-tive elements such as Alu, telomere and simple repeat sequences (Fig. 3c,d). For example, only 0.3% of the concordant SNVs were associated with telomere or centromere sequences, but 4% and 2% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with telomeric or centromeric repeats (Fig. 3c,e). The enrichment of platform-specific SNVs with simple repeats and low-complexity repeats was particularly evident. We found that <1% of the concordant SNVs were associated with simple repeats, but 8% and 15% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with these sequences. Among the platform-specific SNVs, CG had a stronger association with the Alu element and centromere and telomere sequences, whereas Illumina

had a stronger association with L1, simple repeat and low-complexity repeat. Overall, these results indicate that many platform-specific SNVs lie in repetitive regions, suggesting that these calls may be due to mapping difficulties and errors.

We also measured GC content and read depth of the SNVs in the gene and repeat regions. The average GC content of the concordant, CG-specific and Illumina-specific SNVs were 0.46, 0.45 and 0.41, respectively. The average read depths were 48, 47 and 44, respectively. Thus, the Illumina-specific SNVs showed a lower GC content and read depth compared to the concordant SNVs. Analysis by gene and repeat regions did not reveal any strong correlation with GC content. However, we found that Illumina-specific SNVs had a strikingly higher read depth in centromeric and telomeric regions, whereas CG had higher read depth in the tRNA and rRNA regions (Supplementary Fig. 2).

Differences in indel callsWe also examined small indel calls from Illumina and CG platforms. Small indels ranged in size from −107 to +36 bp by Illumina and −190 to +48 bp by CG. Illumina calls were made using GATK with the

Dindel model21, and CG calls were obtained from their standard pipeline and converted to VCF format22 using the CG conversion tool. A stringent quality score cutoff of 30 was used for each platform. This resulted in a total of 811,903 indel calls with 611,110 for Illumina and 430,258 for CG. We found that only 215,382 (26.5%) indels were detected by both Illumina and CG, whereas

Complete Genomics specific99,578

Illumina specific345,100

CG no-call230,119; 67%

CGSub & other77,196; 22%

CG ref.37,785; 11%

IL no-call74,556; 75%

IL ref. 25,022;25%

CG+IL

Illumina

Union

Blood

3,570,658 3,528,194

Merge

Saliva

Complete Genomics

Blood

Merge

Union2.7% 9.2%

3,295,023Concordant SNPs

88.1%

Sensitivity: 99.34%

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Sensitivity

Concordant

Ti/Tv

Known

Novel

Sanger

Validated

3,394,601

2.13

99,578 (3.0%)

1.68

72,735 (73.0%)

26,843 (27.0%)

94.4% (17/18)

61.9%

3,640,123

2.05

345,100 (10.5%)

1.40

260,108 (75.4%)

84,992 (24.6%)

13.3% (2/15)

64.3%

3,739,701

2.04

99.5%

3,295,023

2.14

3,160,905 (95.9%)

134,118 (4.1%)

100% (20/20)

92.7%

Overall

Intersect Intersect

3,277,339 3,286,645

Saliva

a

b

Figure 2 SNV detection and intersection. (a) SNVs detected from the PBMC and saliva samples in each platform were combined. The unions of SNVs in each platform were then intersected. Sensitivity was measured against the Illumina Omni array. Ti/Tv is the transition-to-transversion ratio. The known and novel counts were based on dbSNP. ‘Sanger’ and ‘validated’ represent validation by Sanger sequencing and Illumina sequencing (with Agilent target enrichment capture), respectively. (b) Comparing platform-specific SNVs to non-SNV calls in another platform. IL, Illumina; CG, Complete Genomics.

Table 2 Agilent SureSelect target enrichment capture with Illumina sequencingCG specific Illumina specific Concordant

Total 99,578 — 345,100 — 3,295,023 —Targeted 3,015 3.0% 33,084 9.6% 24,247 0.7%Not validated 388 12.9% 7,088 21.4% 3,053 12.6%Invalidated 1,001 33.2% 9,280 28.0% 1,543 6.4%Validated 1,626 53.9% 16,716 50.5% 19,651 81.0%Validation rate — 61.9% — 64.3% — 92.7%

©20

12 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

ANALYS I S

80 VOLUME 30 NUMBER 1 JANUARY 2012 NATURE BIOTECHNOLOGY

To further ascertain whether the platform-specific SNVs might be located in functionally important regions, we examined whether the variant calls were present in the Varimed data-base2,17, which contains variants catalogued through genome-wide association studies and other genetic linkage studies. We found that 31 Illumina- and 3 CG-specific SNVs were present in Varimed, from which we were able to estimate associations between diseases and platform-specific SNPs (Supplementary Table 1). One of these, rs2672598, was called in both PBMCs and saliva by the Illumina platform, but not called in either PBMCs or saliva by the CG platform. This SNP is at the 5 end of HTRA1 and known to increase the risk of age-related macular degeneration by 4.89-fold (P = 3.39 × 10−11)18,19. Another example is the A202T allele in the TERT gene encoding telomerase. This allele has been associated with aplastic ane-mia20 and was only detected by the Illumina platform. Thus, some platform-specific calls are of high importance.

Association of repetitive regions with variant calling differencesIn contrast to coding SNVs, we found that overall the platform- specific SNVs had a substantially stronger association with repeti-tive elements such as Alu, telomere and simple repeat sequences (Fig. 3c,d). For example, only 0.3% of the concordant SNVs were associated with telomere or centromere sequences, but 4% and 2% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with telomeric or centromeric repeats (Fig. 3c,e). The enrichment of platform-specific SNVs with simple repeats and low-complexity repeats was particularly evident. We found that <1% of the concordant SNVs were associated with simple repeats, but 8% and 15% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with these sequences. Among the platform-specific SNVs, CG had a stronger association with the Alu element and centromere and telomere sequences, whereas Illumina

had a stronger association with L1, simple repeat and low-complexity repeat. Overall, these results indicate that many platform-specific SNVs lie in repetitive regions, suggesting that these calls may be due to mapping difficulties and errors.

We also measured GC content and read depth of the SNVs in the gene and repeat regions. The average GC content of the concordant, CG-specific and Illumina-specific SNVs were 0.46, 0.45 and 0.41, respectively. The average read depths were 48, 47 and 44, respectively. Thus, the Illumina-specific SNVs showed a lower GC content and read depth compared to the concordant SNVs. Analysis by gene and repeat regions did not reveal any strong correlation with GC content. However, we found that Illumina-specific SNVs had a strikingly higher read depth in centromeric and telomeric regions, whereas CG had higher read depth in the tRNA and rRNA regions (Supplementary Fig. 2).

Differences in indel callsWe also examined small indel calls from Illumina and CG platforms. Small indels ranged in size from −107 to +36 bp by Illumina and −190 to +48 bp by CG. Illumina calls were made using GATK with the

Dindel model21, and CG calls were obtained from their standard pipeline and converted to VCF format22 using the CG conversion tool. A stringent quality score cutoff of 30 was used for each platform. This resulted in a total of 811,903 indel calls with 611,110 for Illumina and 430,258 for CG. We found that only 215,382 (26.5%) indels were detected by both Illumina and CG, whereas

Complete Genomics specific99,578

Illumina specific345,100

CG no-call230,119; 67%

CGSub & other77,196; 22%

CG ref.37,785; 11%

IL no-call74,556; 75%

IL ref. 25,022;25%

CG+IL

Illumina

Union

Blood

3,570,658 3,528,194

Merge

Saliva

Complete Genomics

Blood

Merge

Union2.7% 9.2%

3,295,023Concordant SNPs

88.1%

Sensitivity: 99.34%

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Sensitivity

Concordant

Ti/Tv

Known

Novel

Sanger

Validated

3,394,601

2.13

99,578 (3.0%)

1.68

72,735 (73.0%)

26,843 (27.0%)

94.4% (17/18)

61.9%

3,640,123

2.05

345,100 (10.5%)

1.40

260,108 (75.4%)

84,992 (24.6%)

13.3% (2/15)

64.3%

3,739,701

2.04

99.5%

3,295,023

2.14

3,160,905 (95.9%)

134,118 (4.1%)

100% (20/20)

92.7%

Overall

Intersect Intersect

3,277,339 3,286,645

Saliva

a

b

Figure 2 SNV detection and intersection. (a) SNVs detected from the PBMC and saliva samples in each platform were combined. The unions of SNVs in each platform were then intersected. Sensitivity was measured against the Illumina Omni array. Ti/Tv is the transition-to-transversion ratio. The known and novel counts were based on dbSNP. ‘Sanger’ and ‘validated’ represent validation by Sanger sequencing and Illumina sequencing (with Agilent target enrichment capture), respectively. (b) Comparing platform-specific SNVs to non-SNV calls in another platform. IL, Illumina; CG, Complete Genomics.

Table 2 Agilent SureSelect target enrichment capture with Illumina sequencingCG specific Illumina specific Concordant

Total 99,578 — 345,100 — 3,295,023 —Targeted 3,015 3.0% 33,084 9.6% 24,247 0.7%Not validated 388 12.9% 7,088 21.4% 3,053 12.6%Invalidated 1,001 33.2% 9,280 28.0% 1,543 6.4%Validated 1,626 53.9% 16,716 50.5% 19,651 81.0%Validation rate — 61.9% — 64.3% — 92.7%

!"0&/=&"9+&)23)&

Page 9: 01/$&)234& - ut

I6A/8/6A/6=&."9%A"]-6&

F06%&e#"A&3K&?/6-=B8%6(&"$$"B&

•  FL&)X2Q33)&7"99;&A/=/7=/A&T%=>&F06%&"$$"BQ&**+WY&8$/;/6=Q&

**+4UY&7-67-$A"6=Q&-69B&2+3XY&89":-$0[;8/7%O7&',^;+&&

!"0&/=&"9+&)23)&

Page 10: 01/$&)234& - ut

I6A/8/6A/6=&."9%A"]-6&

F06%&e#"A&3K&?/6-=B8%6(&"$$"B&

•  FL&)X2Q33)&7"99;&A/=/7=/A&T%=>&F06%&"$$"BQ&**+WY&8$/;/6=Q&

**+4UY&7-67-$A"6=Q&-69B&2+3XY&89":-$0[;8/7%O7&',^;+&&

'"6(/$&;/5#/67%6(P&$"6A-09B&;/9/7=/A&',^;Q&1-=>&7-67-$A"6=&"6A&

89":-$0&;8/7%O7&

•  ^"9%A"=/A&)2&-L&)2&7-67-$A"6=&',^;f&)&-L&3W&_34+4Y`&I99#0%6"[

;8/7%O7&"6A&3c&-L&3V&_*U+UY`&\?[;8/7%O7&

!"0&/=&"9+&)23)&

Page 11: 01/$&)234& - ut

I6A/8/6A/6=&."9%A"]-6&

F06%&e#"A&3K&?/6-=B8%6(&"$$"B&

•  FL&)X2Q33)&7"99;&A/=/7=/A&T%=>&F06%&"$$"BQ&**+WY&8$/;/6=Q&

**+4UY&7-67-$A"6=Q&-69B&2+3XY&89":-$0[;8/7%O7&',^;+&&

'"6(/$&;/5#/67%6(P&$"6A-09B&;/9/7=/A&',^;Q&1-=>&7-67-$A"6=&"6A&

89":-$0&;8/7%O7&

•  ^"9%A"=/A&)2&-L&)2&7-67-$A"6=&',^;f&)&-L&3W&_34+4Y`&I99#0%6"[

;8/7%O7&"6A&3c&-L&3V&_*U+UY`&\?[;8/7%O7&

<(%9/6=&'#$/'/9/7=&="$(/=&/6$%7>0/6=&L-$&=-="9&44Q2VU&',^;&

•  ^"9%A"=/A&*)+cY&&-L&7-67-$A"6=f&X3+*Y&\?[;8/7%O7f&XU+4Y&I99#0%6"[;8/7%O7&',^;&

!"0&/=&"9+&)23)&

Page 12: 01/$&)234& - ut

,!&@A521B2& 'C&@A521B2&

,!D'C&

E0FG& H0EG&

\"+&I0I&J1661=7&-./$9"88%6(&',^;&

KK0LY&-L&=-="9&#6%5#/&',^;&HK0HY&-L&=>/&-./$9"88%6(&"$/&

7-67-$A"6=&

'/6;%].%=B&HH0IMY&-L&I99#0%6"&F06%&"$$"B&

!"0&/=&"9+&)23)&

Page 13: 01/$&)234& - ut

,!&@A521B2& 'C&@A521B2&

,!D'C&

E0FG&

H0EG&

\"+&I0I&J1661=7&-./$9"88%6(&',^;&

KK0LY&-L&=-="9&#6%5#/&',^;&

HK0HY&-L&=>/&-./$9"88%6(&"$/&7-67-$A"6=&

©20

12 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

ANALYS I S

80 VOLUME 30 NUMBER 1 JANUARY 2012 NATURE BIOTECHNOLOGY

To further ascertain whether the platform-specific SNVs might be located in functionally important regions, we examined whether the variant calls were present in the Varimed data-base2,17, which contains variants catalogued through genome-wide association studies and other genetic linkage studies. We found that 31 Illumina- and 3 CG-specific SNVs were present in Varimed, from which we were able to estimate associations between diseases and platform-specific SNPs (Supplementary Table 1). One of these, rs2672598, was called in both PBMCs and saliva by the Illumina platform, but not called in either PBMCs or saliva by the CG platform. This SNP is at the 5 end of HTRA1 and known to increase the risk of age-related macular degeneration by 4.89-fold (P = 3.39 × 10−11)18,19. Another example is the A202T allele in the TERT gene encoding telomerase. This allele has been associated with aplastic ane-mia20 and was only detected by the Illumina platform. Thus, some platform-specific calls are of high importance.

Association of repetitive regions with variant calling differencesIn contrast to coding SNVs, we found that overall the platform- specific SNVs had a substantially stronger association with repeti-tive elements such as Alu, telomere and simple repeat sequences (Fig. 3c,d). For example, only 0.3% of the concordant SNVs were associated with telomere or centromere sequences, but 4% and 2% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with telomeric or centromeric repeats (Fig. 3c,e). The enrichment of platform-specific SNVs with simple repeats and low-complexity repeats was particularly evident. We found that <1% of the concordant SNVs were associated with simple repeats, but 8% and 15% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with these sequences. Among the platform-specific SNVs, CG had a stronger association with the Alu element and centromere and telomere sequences, whereas Illumina

had a stronger association with L1, simple repeat and low-complexity repeat. Overall, these results indicate that many platform-specific SNVs lie in repetitive regions, suggesting that these calls may be due to mapping difficulties and errors.

We also measured GC content and read depth of the SNVs in the gene and repeat regions. The average GC content of the concordant, CG-specific and Illumina-specific SNVs were 0.46, 0.45 and 0.41, respectively. The average read depths were 48, 47 and 44, respectively. Thus, the Illumina-specific SNVs showed a lower GC content and read depth compared to the concordant SNVs. Analysis by gene and repeat regions did not reveal any strong correlation with GC content. However, we found that Illumina-specific SNVs had a strikingly higher read depth in centromeric and telomeric regions, whereas CG had higher read depth in the tRNA and rRNA regions (Supplementary Fig. 2).

Differences in indel callsWe also examined small indel calls from Illumina and CG platforms. Small indels ranged in size from −107 to +36 bp by Illumina and −190 to +48 bp by CG. Illumina calls were made using GATK with the

Dindel model21, and CG calls were obtained from their standard pipeline and converted to VCF format22 using the CG conversion tool. A stringent quality score cutoff of 30 was used for each platform. This resulted in a total of 811,903 indel calls with 611,110 for Illumina and 430,258 for CG. We found that only 215,382 (26.5%) indels were detected by both Illumina and CG, whereas

Complete Genomics specific99,578

Illumina specific345,100

CG no-call230,119; 67%

CGSub & other77,196; 22%

CG ref.37,785; 11%

IL no-call74,556; 75%

IL ref. 25,022;25%

CG+IL

Illumina

Union

Blood

3,570,658 3,528,194

Merge

Saliva

Complete Genomics

Blood

Merge

Union2.7% 9.2%

3,295,023Concordant SNPs

88.1%

Sensitivity: 99.34%

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Sensitivity

Concordant

Ti/Tv

Known

Novel

Sanger

Validated

3,394,601

2.13

99,578 (3.0%)

1.68

72,735 (73.0%)

26,843 (27.0%)

94.4% (17/18)

61.9%

3,640,123

2.05

345,100 (10.5%)

1.40

260,108 (75.4%)

84,992 (24.6%)

13.3% (2/15)

64.3%

3,739,701

2.04

99.5%

3,295,023

2.14

3,160,905 (95.9%)

134,118 (4.1%)

100% (20/20)

92.7%

Overall

Intersect Intersect

3,277,339 3,286,645

Saliva

a

b

Figure 2 SNV detection and intersection. (a) SNVs detected from the PBMC and saliva samples in each platform were combined. The unions of SNVs in each platform were then intersected. Sensitivity was measured against the Illumina Omni array. Ti/Tv is the transition-to-transversion ratio. The known and novel counts were based on dbSNP. ‘Sanger’ and ‘validated’ represent validation by Sanger sequencing and Illumina sequencing (with Agilent target enrichment capture), respectively. (b) Comparing platform-specific SNVs to non-SNV calls in another platform. IL, Illumina; CG, Complete Genomics.

Table 2 Agilent SureSelect target enrichment capture with Illumina sequencingCG specific Illumina specific Concordant

Total 99,578 — 345,100 — 3,295,023 —Targeted 3,015 3.0% 33,084 9.6% 24,247 0.7%Not validated 388 12.9% 7,088 21.4% 3,053 12.6%Invalidated 1,001 33.2% 9,280 28.0% 1,543 6.4%Validated 1,626 53.9% 16,716 50.5% 19,651 81.0%Validation rate — 61.9% — 64.3% — 92.7%

©20

12 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

ANALYS I S

80 VOLUME 30 NUMBER 1 JANUARY 2012 NATURE BIOTECHNOLOGY

To further ascertain whether the platform-specific SNVs might be located in functionally important regions, we examined whether the variant calls were present in the Varimed data-base2,17, which contains variants catalogued through genome-wide association studies and other genetic linkage studies. We found that 31 Illumina- and 3 CG-specific SNVs were present in Varimed, from which we were able to estimate associations between diseases and platform-specific SNPs (Supplementary Table 1). One of these, rs2672598, was called in both PBMCs and saliva by the Illumina platform, but not called in either PBMCs or saliva by the CG platform. This SNP is at the 5 end of HTRA1 and known to increase the risk of age-related macular degeneration by 4.89-fold (P = 3.39 × 10−11)18,19. Another example is the A202T allele in the TERT gene encoding telomerase. This allele has been associated with aplastic ane-mia20 and was only detected by the Illumina platform. Thus, some platform-specific calls are of high importance.

Association of repetitive regions with variant calling differencesIn contrast to coding SNVs, we found that overall the platform- specific SNVs had a substantially stronger association with repeti-tive elements such as Alu, telomere and simple repeat sequences (Fig. 3c,d). For example, only 0.3% of the concordant SNVs were associated with telomere or centromere sequences, but 4% and 2% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with telomeric or centromeric repeats (Fig. 3c,e). The enrichment of platform-specific SNVs with simple repeats and low-complexity repeats was particularly evident. We found that <1% of the concordant SNVs were associated with simple repeats, but 8% and 15% of the CG-specific SNVs and Illumina-specific SNVs, respectively, were associated with these sequences. Among the platform-specific SNVs, CG had a stronger association with the Alu element and centromere and telomere sequences, whereas Illumina

had a stronger association with L1, simple repeat and low-complexity repeat. Overall, these results indicate that many platform-specific SNVs lie in repetitive regions, suggesting that these calls may be due to mapping difficulties and errors.

We also measured GC content and read depth of the SNVs in the gene and repeat regions. The average GC content of the concordant, CG-specific and Illumina-specific SNVs were 0.46, 0.45 and 0.41, respectively. The average read depths were 48, 47 and 44, respectively. Thus, the Illumina-specific SNVs showed a lower GC content and read depth compared to the concordant SNVs. Analysis by gene and repeat regions did not reveal any strong correlation with GC content. However, we found that Illumina-specific SNVs had a strikingly higher read depth in centromeric and telomeric regions, whereas CG had higher read depth in the tRNA and rRNA regions (Supplementary Fig. 2).

Differences in indel callsWe also examined small indel calls from Illumina and CG platforms. Small indels ranged in size from −107 to +36 bp by Illumina and −190 to +48 bp by CG. Illumina calls were made using GATK with the

Dindel model21, and CG calls were obtained from their standard pipeline and converted to VCF format22 using the CG conversion tool. A stringent quality score cutoff of 30 was used for each platform. This resulted in a total of 811,903 indel calls with 611,110 for Illumina and 430,258 for CG. We found that only 215,382 (26.5%) indels were detected by both Illumina and CG, whereas

Complete Genomics specific99,578

Illumina specific345,100

CG no-call230,119; 67%

CGSub & other77,196; 22%

CG ref.37,785; 11%

IL no-call74,556; 75%

IL ref. 25,022;25%

CG+IL

Illumina

Union

Blood

3,570,658 3,528,194

Merge

Saliva

Complete Genomics

Blood

Merge

Union2.7% 9.2%

3,295,023Concordant SNPs

88.1%

Sensitivity: 99.34%

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Specific

Ti/Tv

Known

Novel

Sanger

Validated

Total

Ti/Tv

Sensitivity

Concordant

Ti/Tv

Known

Novel

Sanger

Validated

3,394,601

2.13

99,578 (3.0%)

1.68

72,735 (73.0%)

26,843 (27.0%)

94.4% (17/18)

61.9%

3,640,123

2.05

345,100 (10.5%)

1.40

260,108 (75.4%)

84,992 (24.6%)

13.3% (2/15)

64.3%

3,739,701

2.04

99.5%

3,295,023

2.14

3,160,905 (95.9%)

134,118 (4.1%)

100% (20/20)

92.7%

Overall

Intersect Intersect

3,277,339 3,286,645

Saliva

a

b

Figure 2 SNV detection and intersection. (a) SNVs detected from the PBMC and saliva samples in each platform were combined. The unions of SNVs in each platform were then intersected. Sensitivity was measured against the Illumina Omni array. Ti/Tv is the transition-to-transversion ratio. The known and novel counts were based on dbSNP. ‘Sanger’ and ‘validated’ represent validation by Sanger sequencing and Illumina sequencing (with Agilent target enrichment capture), respectively. (b) Comparing platform-specific SNVs to non-SNV calls in another platform. IL, Illumina; CG, Complete Genomics.

Table 2 Agilent SureSelect target enrichment capture with Illumina sequencingCG specific Illumina specific Concordant

Total 99,578 — 345,100 — 3,295,023 —Targeted 3,015 3.0% 33,084 9.6% 24,247 0.7%Not validated 388 12.9% 7,088 21.4% 3,053 12.6%Invalidated 1,001 33.2% 9,280 28.0% 1,543 6.4%Validated 1,626 53.9% 16,716 50.5% 19,651 81.0%Validation rate — 61.9% — 64.3% — 92.7%

!"0&/=&"9+&)23)&

Page 14: 01/$&)234& - ut

!!"#$%&'()'

SNV quality in different platforms. Comparing the phred-scale quality scores of all SNVs from each platform to that of the specific SNVs. Each column represents (from top to bottom) the upper whisker, upper hinge, median, lower hinge and lower whisker. Outliners are not shown.!

Nature Biotechnology: doi:10.1038/nbt.2065

•  ]g=.&-L&7-67-$A"6=&',^;&T";&./$B&79-;/&=-&=>"=&/d8/7=/AQ&T>/$/";&=>/&89":-$0[;8/7%O7&]g=.&T";&0#7>&9-T/$&&

•  e#"9%=B&;7-$/;&-L&89":-$0[;8/7%O7&',^;&T/$/&9-T/$&

!"0&/=&"9+&)23)&

Page 15: 01/$&)234& - ut

•  <./$"(/&?\&7-6=/6=&

•  \-67-$A"6=P&2+UX&

•  \?[;8/7%O7P&2+UW&

•  &I99#0%6"[;8/7%O7P&2+U3&

•  <./$"(/&$/"A&A/8=>;P&UVQ&Uc&"6A&UUQ&$/;8/7]./9B+&&

•  ,-&;=$-6(&7-$$/9"]-6&-L&',^&A/=/7]-6&T%=>&?\&7-6=/6=+&&

!"0&/=&"9+&)23)&

Page 16: 01/$&)234& - ut

•  @9":-$0[;8/7%O7&',^;&";;-7%"=/A&T%=>&$/8/]]./&/9/0/6=;&;#7>&

";&<9#Q&=/9-0/$/&"6A&;%089/&$/8/"=&;/5#/67/;&&

•  F69B&2+4Y&-L&=>/&7-67-$A"6=&',^;&%6&=/9-0/$/;&-$&7/6=$-0/$/;Q&

-88-;/A&=-&UY&-L&\?[;8/7%O7&"6A&)Y&-L&I99#0%6"[;8/7%O7&&

©2

01

2 N

atu

re A

me

ric

a, I

nc

. A

ll r

igh

ts r

es

erv

ed

.

ANALYS I S

NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 1 JANUARY 2012 81

390,060 (48.1%) and 206,461 (25.4%) were Illumina- and CG-specific, respectively (Fig. 4a). Owing to the complexity of indels compared to SNVs, the number of concordant indels was much lower than the number of concordant SNVs. We also observed that the indels

detected by both platforms were similar in their size distribution and type (Fig. 4b), though it is noteworthy that the Illumina data showed a slight enrichment of 1-bp insertions, whereas the CG data showed a slight enrichment of 1-bp deletions.

d e25

20

16

19

22

14

21

14

1

L1 Alu Simple repeat Lowcomplexity

8

02

3

1515

10

5

0

Perc

ent ass

oci

atio

n o

f S

NV

s

b16

14

12

10

8

6

4

2

0Splicing ncRNA Upstream Downstream

0 0 0

11

14

12

1 1 11 1 1P

erc

ent ass

oci

atio

n o

f S

NV

s

c

Centromere Telomere tRNA rRNA

5

4

2

3

1

0

0

4 4

2

0 0 0 00

0

2

0Perc

ent ass

oci

atio

n o

f S

NV

s

70

Perc

ent ass

oci

atio

n o

f S

NV

s

a26,542

20,591

7,498

1,872 2,469

3832

32

5659 60

785

1,762190

0 0 1 1 1 1 1 1 1

725

60

50

40

30

20

10

0UTR5

Concordant CG speci!c IL speci!c

Concordant CG speci!c IL speci!c

Concordant CG speci!c IL speci!c Concordant CG speci!c IL speci!c

UTR3 Exonic Intronic Intergenic

3

2

1

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

x

y50

250

150

125

100

75

50 25 0 5025

0

25 0

50 25 050 25

0 75 50

25 0

75 50 25 0

75 50

025

100 75 50 25

0

100

75

50

25 0

10075

5025

0

125

100

75

50

25

012

5 10

0 75

50

25

0

125

100

75

50

25

0

0 25

50 75

100

125

150

175

0 25 5075 100 125 150 175

0 25 50 75

100 125

150 175

025

5075 100 125 150

0 25 50 75 100 125 150025

75 100

12502550

100

75

125

50

0 25

50

75

100

125

150

175

200

225

0 2

550

75

100

125

150

175

200

225

Figure 3 SNV association with different genomic elements. (a) Gene elements: UTR, exonic, intronic and intergenic regions. Inset: number of SNVs associated with UTR5, UTR3 and exonic regions. (b) Gene elements: splicing sites, noncoding RNA and upstream/downstream (<1 kb) regions of genes. (c) Repetitive elements: centromere, telomere, tRNA and rRNA. (d) Repetitive elements: L1, Alu, simple repeat and low-complexity repeat. (e) SNV frequency at different chromosomal locations. Tracks from outer to inner: SNV frequency for Illumina (IL), Complete Genomics (CG), concordant, IL-specific and CG-specific calls. Outermost: chromosome ideogram.

Complete Genomics

Blood

361,783 341,172

Merge

Union

Intersect

Total

Total 811,903

215,382Concordant

Specific 206,461 (48.0%)

430,258

Saliva Blood

523,445 555,770

Merge

Union

Total

Specific 390,060 (63.8%)

Intersect

611,110

Saliva

206,461CG-specific

(25.4%)

215,382Concordant indels

(26.5%)

Overall

CG+IL

390,060IL-specific(48.1%)

Illuminaa Complete Genomics

Illumina160,000

140,000

120,000

100,000

80,000

60,000

40,000

–72–6

8–6

4–6

0–5

6–5

2–4

8–4

4–4

0–3

6–3

2–2

8–2

4–2

0–1

6–1

2 –8 –4 00

Indel size

4 8 12 16 20 24 28 32 36 40 44 48

20,000

b

Figure 4 Indel detection and intersection. (a) Indels detected from the PBMC and saliva samples in each platform were combined. The unions of indels in each platform were then intersected. Note: 5,668 IL and 8,415 CG indels were removed after 5b-window merging. (b) Indel size distribution. Negative size represents deletion and positive size represents insertion.

©2

01

2 N

atu

re A

me

ric

a, I

nc

. A

ll r

igh

ts r

es

erv

ed

.

ANALYS I S

NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 1 JANUARY 2012 81

390,060 (48.1%) and 206,461 (25.4%) were Illumina- and CG-specific, respectively (Fig. 4a). Owing to the complexity of indels compared to SNVs, the number of concordant indels was much lower than the number of concordant SNVs. We also observed that the indels

detected by both platforms were similar in their size distribution and type (Fig. 4b), though it is noteworthy that the Illumina data showed a slight enrichment of 1-bp insertions, whereas the CG data showed a slight enrichment of 1-bp deletions.

d e25

20

16

19

22

14

21

14

1

L1 Alu Simple repeat Lowcomplexity

8

02

3

1515

10

5

0

Perc

ent ass

oci

atio

n o

f S

NV

s

b16

14

12

10

8

6

4

2

0Splicing ncRNA Upstream Downstream

0 0 0

11

14

12

1 1 11 1 1P

erc

ent ass

oci

atio

n o

f S

NV

s

c

Centromere Telomere tRNA rRNA

5

4

2

3

1

0

0

4 4

2

0 0 0 00

0

2

0Perc

ent ass

oci

atio

n o

f S

NV

s

70

Perc

ent ass

oci

atio

n o

f S

NV

s

a26,542

20,591

7,498

1,872 2,469

3832

32

5659 60

785

1,762190

0 0 1 1 1 1 1 1 1

725

60

50

40

30

20

10

0UTR5

Concordant CG speci!c IL speci!c

Concordant CG speci!c IL speci!c

Concordant CG speci!c IL speci!c Concordant CG speci!c IL speci!c

UTR3 Exonic Intronic Intergenic

3

2

1

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

x

y50

250

150

125

100

75

50 25 0 5025

0

25 0

50 25 050 25

0 75 50

25 0

75 50 25 0

75 50

025

100 75 50 25

0

100

75

50

25 0

10075

5025

0

125

100

75

50

25

012

5 10

0 75

50

25

0

125

100

75

50

25

0

0 25

50 75

100

125

150

175

0 25 5075 100 125 150 175

0 25 50 75

100 125

150 175

025

5075 100 125 150

0 25 50 75 100 125 150025

75 100

12502550

100

75

125

50

0 25

50

75

100

125

150

175

200

225

0 2

550

75

100

125

150

175

200

225

Figure 3 SNV association with different genomic elements. (a) Gene elements: UTR, exonic, intronic and intergenic regions. Inset: number of SNVs associated with UTR5, UTR3 and exonic regions. (b) Gene elements: splicing sites, noncoding RNA and upstream/downstream (<1 kb) regions of genes. (c) Repetitive elements: centromere, telomere, tRNA and rRNA. (d) Repetitive elements: L1, Alu, simple repeat and low-complexity repeat. (e) SNV frequency at different chromosomal locations. Tracks from outer to inner: SNV frequency for Illumina (IL), Complete Genomics (CG), concordant, IL-specific and CG-specific calls. Outermost: chromosome ideogram.

Complete Genomics

Blood

361,783 341,172

Merge

Union

Intersect

Total

Total 811,903

215,382Concordant

Specific 206,461 (48.0%)

430,258

Saliva Blood

523,445 555,770

Merge

Union

Total

Specific 390,060 (63.8%)

Intersect

611,110

Saliva

206,461CG-specific

(25.4%)

215,382Concordant indels

(26.5%)

Overall

CG+IL

390,060IL-specific(48.1%)

Illuminaa Complete Genomics

Illumina160,000

140,000

120,000

100,000

80,000

60,000

40,000

–72–6

8–6

4–6

0–5

6–5

2–4

8–4

4–4

0–3

6–3

2–2

8–2

4–2

0–1

6–1

2 –8 –4 00

Indel size

4 8 12 16 20 24 28 32 36 40 44 48

20,000

b

Figure 4 Indel detection and intersection. (a) Indels detected from the PBMC and saliva samples in each platform were combined. The unions of indels in each platform were then intersected. Note: 5,668 IL and 8,415 CG indels were removed after 5b-window merging. (b) Indel size distribution. Negative size represents deletion and positive size represents insertion.

!"0&/=&"9+&)23)&

Page 17: 01/$&)234& - ut

I6A/9&7"99;&"$/&($/"=9B&A%;7-$A"6=&

©20

12 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

ANALYS I S

NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 1 JANUARY 2012 81

390,060 (48.1%) and 206,461 (25.4%) were Illumina- and CG-specific, respectively (Fig. 4a). Owing to the complexity of indels compared to SNVs, the number of concordant indels was much lower than the number of concordant SNVs. We also observed that the indels

detected by both platforms were similar in their size distribution and type (Fig. 4b), though it is noteworthy that the Illumina data showed a slight enrichment of 1-bp insertions, whereas the CG data showed a slight enrichment of 1-bp deletions.

d e25

20

16

19

22

14

21

14

1

L1 Alu Simple repeat Lowcomplexity

8

02

3

1515

10

5

0

Per

cent

ass

ocia

tion

of S

NV

s

b16

14

12

10

8

6

4

2

0Splicing ncRNA Upstream Downstream

0 0 0

11

14

12

1 1 11 1 1P

erce

nt a

ssoc

iatio

n of

SN

Vs

c

Centromere Telomere tRNA rRNA

5

4

2

3

1

0

0

4 4

2

0 0 0 00

0

2

0Per

cent

ass

ocia

tion

of S

NV

s

70

Per

cent

ass

ocia

tion

of S

NV

s

a26,542

20,591

7,498

1,872 2,469

3832

32

5659 60

785

1,762190

0 0 1 1 1 1 1 1 1

725

60

50

40

30

20

10

0UTR5

Concordant CG speci!c IL speci!c

Concordant CG speci!c IL speci!c

Concordant CG speci!c IL speci!c Concordant CG speci!c IL speci!c

UTR3 Exonic Intronic Intergenic

3

2

1

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

x

y50 250

150

125

100 75 50

25 0 50250

25 0

50 25 050 25

0 75 50

25 0

75 50 25 0

75 50

025

100 75 50 25

0

100 75

50

25 0

10075

5025

0

125

100

75

50

25

012

5 10

0 75

50

25

0

125

100

75

50

25

0

0 25

50 75 100

125

150

175

0 25 5075 100 125 150 175

0 25 50 75

100 125

150 175

025

5075 100 125 150

0 25 50 75 100 125 15002575 100

12502550

10075

125

50

0 25

50

75

100

125

150

175

200

225

0 25

50 7

5 1

00 1

25

150

175

200

225

Figure 3 SNV association with different genomic elements. (a) Gene elements: UTR, exonic, intronic and intergenic regions. Inset: number of SNVs associated with UTR5, UTR3 and exonic regions. (b) Gene elements: splicing sites, noncoding RNA and upstream/downstream (<1 kb) regions of genes. (c) Repetitive elements: centromere, telomere, tRNA and rRNA. (d) Repetitive elements: L1, Alu, simple repeat and low-complexity repeat. (e) SNV frequency at different chromosomal locations. Tracks from outer to inner: SNV frequency for Illumina (IL), Complete Genomics (CG), concordant, IL-specific and CG-specific calls. Outermost: chromosome ideogram.

Complete Genomics

Blood

361,783 341,172

Merge

Union

Intersect

Total

Total 811,903

215,382Concordant

Specific 206,461 (48.0%)

430,258

Saliva Blood

523,445 555,770

Merge

Union

Total

Specific 390,060 (63.8%)

Intersect

611,110

Saliva

206,461CG-specific

(25.4%)

215,382Concordant indels

(26.5%)

Overall

CG+IL

390,060IL-specific(48.1%)

Illuminaa Complete Genomics

Illumina160,000

140,000

120,000

100,000

80,000

60,000

40,000

–72–6

8–6

4–6

0–5

6–5

2–4

8–4

4–4

0–3

6–3

2–2

8–2

4–2

0–1

6–1

2 –8 –4 00

Indel size

4 8 12 16 20 24 28 32 36 40 44 48

20,000

b

Figure 4 Indel detection and intersection. (a) Indels detected from the PBMC and saliva samples in each platform were combined. The unions of indels in each platform were then intersected. Note: 5,668 IL and 8,415 CG indels were removed after 5b-window merging. (b) Indel size distribution. Negative size represents deletion and positive size represents insertion.

!"0&/=&"9+&)23)&

Page 18: 01/$&)234& - ut

\-679#;%-6;&

•  D"7>&(/6-0/&;/5#/67%6(&"88$-"7>&%;&(/6/$"99B&7"8"19/&-L&A/=/7]6(&0-;=&',^;&

•  \?&"88/"$;&=-&1/&0-$/&"77#$"=/Q&1#=&"9;-&;9%(>=9B&9/;;&;/6;%]./&&

•  99#0%6"&7-./$;&0-$/&1";/;&"6A&0"h/;&"&>%(>/$&6#01/$&-L&

-./$"99&7"99;Q&1#=&"9;-&>";&0-$/&L"9;/&8-;%]./;&&

!"0&/=&"9+&)23)&

Page 19: 01/$&)234& - ut

\-679#;%-6;&

•  D"7>&(/6-0/&;/5#/67%6(&"88$-"7>&%;&(/6/$"99B&7"8"19/&-L&A/=/7]6(&0-;=&',^;&

•  \?&"88/"$;&=-&1/&0-$/&"77#$"=/Q&1#=&"9;-&;9%(>=9B&9/;;&;/6;%]./&&

•  99#0%6"&7-./$;&0-$/&1";/;&"6A&0"h/;&"&>%(>/$&6#01/$&-L&

-./$"99&7"99;Q&1#=&"9;-&>";&0-$/&L"9;/&8-;%]./;&&

•  a-=>&0/=>-A;&79/"$9B&7"99&."$%"6=;&0%;;/A&1B&=>/&-=>/$&

=/7>6-9-(B&&

•  i%9=/$%6(&L-$&$/8/"=&$/(%-6;&"6A&5#"9%=B&>/98;&=-&$/A#7/&/$$-$;&

•  i-$&0"d%0%j/A&;/6;%].%=B&"6A&5#"9%=BQ&;/5#/67/&%6&8"$"99/9&

!"0&/=&"9+&)23)&

Page 20: 01/$&)234& - ut

•  UWUgH-7>/&?'&i!k&•  I99#0%6"&R%'/5&)222&&

•  <aI&'F!%J&4&D\\&

•  F./$9"8&-L&',^;&

•  3&J,<&;"089/&

Comparison of Sequencing Platforms for SingleNucleotide Variant Calls in a Human SampleAakrosh Ratan1., Webb Miller1, Joseph Guillory2, Jeremy Stinson2, Somasekar Seshagiri2,

Stephan C. Schuster1*.

1Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania, United States of America, 2Department of Molecular

Biology, Genentech Inc., South San Francisco, California, United States of America

Abstract

Next-generation sequencings platforms coupled with advanced bioinformatic tools enable re-sequencing of the humangenome at high-speed and large cost savings. We compare sequencing platforms from Roche/454(GS FLX), Illumina/HiSeq(HiSeq 2000), and Life Technologies/SOLiD (SOLiD 3 ECC) for their ability to identify single nucleotide substitutions in wholegenome sequences from the same human sample. We report on significant GC-related bias observed in the data sequencedon Illumina and SOLiD platforms. The differences in the variant calls were investigated with regards to coverage, andsequencing error. Some of the variants called by only one or two of the platforms were experimentally tested using massspectrometry; a method that is independent of DNA sequencing. We establish several causes why variants remainedunreported, specific to each platform. We report the indel called using the three sequencing technologies and from theobtained results we conclude that sequencing human genomes with more than a single platform and multiple libraries isbeneficial when high level of accuracy is required.

Citation: Ratan A, Miller W, Guillory J, Stinson J, Seshagiri S, et al. (2013) Comparison of Sequencing Platforms for Single Nucleotide Variant Calls in a HumanSample. PLoS ONE 8(2): e55089. doi:10.1371/journal.pone.0055089

Editor: Zhaoxia Yu, University of California Irvine, United States of America

Received June 14, 2012; Accepted December 23, 2012; Published February 6, 2013

Copyright: ! 2013 Ratan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This study is based on sequencing data generated for the Southern African genomes Project [5]. The 454 data was generated at Penn State Universityand supported by Roche/454 via the donation of sequencing reagents. We like to thank Timothy Harkins and Kevin McCarren for making SOLiD data for the KB1genome available to this group of investigators. The sequence data for Illumina GA IIx and HiSeq 2000 was generated at Penn State University and supportedthrough internal funds. The funders had no role in study design, data collection (other than what has been stated above) and analysis, decision to publish, orpreparation of the manuscript. This project is funded, in part, under a grant by the Pennsylvania Department of Health using Tobacco CURE Funds to AR. ThePennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations or conclusions.

Competing Interests: Some of the authors are employed by a commercial company, Genentech Inc. This does not alter the authors’ adherence to all the PLOSONE policies on sharing data and materials.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

The Human Genome Project [1,2] published the first workingdraft of the human reference sequence in 2000. That sequence wasgenerated in its entirety by capillary sequencing; all subsequentgenomes of human individuals [3–9], except one [4], have reliedon next-generation sequencing (NGS) platforms. Starting in 2005,454/Roche [10], and subsequently Illumina [11] and SOLiD/ABI[12] entered the market with technology that ultimately aims to re-sequence human genomes for less than $1000, which wouldtransform the field of personalized medicine. While several newentrants may have the future potential to change the sequencinglandscape yet again, the current sequencing market is dominatedby these three matured platforms. With the introduction of thesetechnologies, reports of biases in all platforms [13], as well asefforts to monitor [14], remove [15], or compensate for themarose. Initially, the three sequencing approaches were evaluated intargeted regions (up to 4 Mbp) [16] where they were comparedwith Sanger-generated sequences for validation. A recent studycompared the accuracy of the SNP calls and the quality of theshort-reads from the three platforms in an E. coli sample [17],while another study compared single nucleotide variants from

Illumina data with the data from Complete Genomics, anotherentrant in the sequencing landscape [18].Here we compare three platforms namely 454/Roche GS FLX,

Illumina HiSeq 2000 and ABI SOLiD 3 ECC in their ability toidentify the single-nucleotide substitutions in the same humanindividual. In contrast to previous studies that generated asaturating level of redundant coverage to eliminate low coverageas a factor in SNP calling, we sequenced the individual’s DNA toread-depths that allows for variant detection in each correspond-ing dataset with sufficient confidence. This allows us to assess theperformance and biases of each sequencing platform, as it wouldaffect the completeness of the variant detection in a genome-sequencing project. In this study we show that both IlluminaHiSeq 2000 and SOLiD 3 ECC sequencing exhibit variation incoverage with GC content, and report on the probable reasonswhy certain SNPs are missed by each of the technologies. We alsodiscuss a method to calculate the uniquely mappable regions of areference genome, which can then be used to filter SNPs andimprove the quality of the variant calls, thereby avoiding thegeneration of false positive variants. This approach is especiallyuseful for 454 GS FLX sequences analyzed using the softwareNewbler [3], which does not utilize the concept of mapping quality[19].

PLOS ONE | www.plosone.org 1 February 2013 | Volume 8 | Issue 2 | e55089

H"="6&/=&"9+&)234&

Page 21: 01/$&)234& - ut

Results

Generation and Alignment of ReadsWe sequenced the genomic DNA from a human DNA sample

called KB1 [5]; to 10.04 fold coverage using 454 GS FLXsequencer, 58.89 fold coverage using Illumina HiSeq 2000sequencer and 78.63 fold coverage using SOLiD 3 ECCtechnology (Table 1). All three platforms are expected to exhibitdifferent error characteristics and therefore should complementone another to yield the most accurate set of human singlenucleotide variants. Since the dominant type of error for SOLiDand Illumina reads is substitutions, it is possible to compare thesequences using similar alignment criteria and software. Using thesoftware BWA [20] (version 0.5.9rc), we aligned the SOLiD andIllumina reads to the human reference GRCh37, henceforthreferred to as hg19. In contrast to randomly dispersed Illuminaand SOLiD sequencing templates, the array-based pyrosequenc-ing technology used by 454 generates sequences with homopol-ymer errors (indels in runs of the same nucleotide). The assembly/mapping software Newbler aligns these data in a format calledflow-space in an attempt to correct these systematic errorsassociated with pyrosequencing. We used Newbler version 2.3 toalign the single-end fragment reads to hg19 with the defaultparameters. Only the two platforms 454 and Illumina exclude thelow-quality reads using filters prior to reporting the final set ofreads to the user; a fact that is reflected in a higher alignment ratefor these two technologies. In contrast, the SOLiD system uses thealignment to a reference as a mean to determine reads of sufficientquality (Table 1).

Coverage Distribution and VariationThe depth-of-coverage distributions of the reference genome

from data from the three sequencing platforms are shown inFigure 1. The coverage distributions are bimodal (Figure 1), withthe two modes that at first appeared to reflect the coverage on theautosomes and the sex chromosomes. However, we found thatremoving the sex chromosomes from the analysis did not eliminatethe bimodal behavior. Since, earlier studies reported a decrease incoverage in A/T rich regions with Illumina sequencing [21], weinvestigated the correlation between the GC content and coveragefor the three platforms, for potentially influencing bimodalbehavior. Figure 2 shows a significant variation in coverage withGC content; coverage by Illumina and SOLiD sequences isnotably lower in G/C rich regions. This variation with GCcontent, along with the expected haploid coverage on sexchromosomes, explains the observed bimodal behavior of these

distributions. Despite sharing the emulsion PCR approach in thesequencing template preparation with the SOLiD 3 system, onlythe Roche/454 FLX sequencing chemistry seemed to be immuneto this bias, as is demonstrated with only a minor correlationbetween GC content and coverage for 454 reads (Figure 2). Thebehavior exhibited by Illumina HiSeq sequences is in starkcontrast to the behavior of GA II reads, which exhibit a lowercoverage in A/T rich regions.

Detection of SNPs and IndelsA large portion of the genomic regions requires local

realignments due to the presence of indels in the sequencedgenome when compared to the reference genome. Indels can leadto alignment artifacts where a lot of bases around the indel do notagree with the reference and can masquerade as SNPs. We usedthe realignment tool in GATK [22] version 1.2–29 to realign thesequences from the Illumina and SOLiD dataset; followed by useof SAMtools version 0.1.16 to call variants. As described in the

Table 1. Sequencing and alignment statistics. Coverage is calculated with and without the putative PCR duplicates.

454 Illumina SOLiD

Number of reads generated 83,331,227 1,867,073,052 6,905,193,148

Number of bases generated 29,246,232,549 188,349,876,745 397,681,271,500

Read lengths 350 avg. single-end 101 paired-end 50 paired-end, 75 single-end

Number of reads aligned 82,310,265 (98.77%) 1,751,042,389 (93.79%) 4,429,505,837 (64.15%)

Number of bases aligned 28,732,501,185 (98.24%) 168,495,777,999 (89.46%) 224,998,686,646 (56.58%)

Coverage 10.04/9.78 X 58.89/55.06 X 78.63/53.20 X

Duplicate reads 2,211,903 115,528,614 1,216,108,795

Reference bases covered 2,781,827,482 2,858,458,440 2,850,277,778

The number of aligned reads includes the duplicate reads.doi:10.1371/journal.pone.0055089.t001

Figure 1. Depth of coverage distribution for the threeplatforms. The y-axis indicates the fraction of the bases in thereference sequence that has a particular coverage. This does notinclude secondary alignments and potential PCR duplicates. The dashedlighter curves depict the coverage distribution as calculated using aPoisson model for each sequencing technology.doi:10.1371/journal.pone.0055089.g001

Comparison of Sequencing Platforms

PLOS ONE | www.plosone.org 2 February 2013 | Volume 8 | Issue 2 | e55089•  M6/./6&7-./$"(/&1/=T//6&89":-$0;&

•  lT/&;/5#/67/A&=>/&%6A%.%A#"9Z;&J,<&=-&$/"A[A/8=>;&=>"=&"99-T;&L-$&."$%"6=&A/=/7]-6&%6&/"7>&7-$$/;8-6A%6(&A"=";/=&T%=>&;#m7%/6=&7-6OA/67/&l&

H"="6&/=&"9+&)234&

Page 22: 01/$&)234& - ut

\-./$"(/&

•  UWUP&32d&•  I99#0%6"P&W*d&•  'F!%JP&c*d&

I!&

Results

Generation and Alignment of ReadsWe sequenced the genomic DNA from a human DNA sample

called KB1 [5]; to 10.04 fold coverage using 454 GS FLXsequencer, 58.89 fold coverage using Illumina HiSeq 2000sequencer and 78.63 fold coverage using SOLiD 3 ECCtechnology (Table 1). All three platforms are expected to exhibitdifferent error characteristics and therefore should complementone another to yield the most accurate set of human singlenucleotide variants. Since the dominant type of error for SOLiDand Illumina reads is substitutions, it is possible to compare thesequences using similar alignment criteria and software. Using thesoftware BWA [20] (version 0.5.9rc), we aligned the SOLiD andIllumina reads to the human reference GRCh37, henceforthreferred to as hg19. In contrast to randomly dispersed Illuminaand SOLiD sequencing templates, the array-based pyrosequenc-ing technology used by 454 generates sequences with homopol-ymer errors (indels in runs of the same nucleotide). The assembly/mapping software Newbler aligns these data in a format calledflow-space in an attempt to correct these systematic errorsassociated with pyrosequencing. We used Newbler version 2.3 toalign the single-end fragment reads to hg19 with the defaultparameters. Only the two platforms 454 and Illumina exclude thelow-quality reads using filters prior to reporting the final set ofreads to the user; a fact that is reflected in a higher alignment ratefor these two technologies. In contrast, the SOLiD system uses thealignment to a reference as a mean to determine reads of sufficientquality (Table 1).

Coverage Distribution and VariationThe depth-of-coverage distributions of the reference genome

from data from the three sequencing platforms are shown inFigure 1. The coverage distributions are bimodal (Figure 1), withthe two modes that at first appeared to reflect the coverage on theautosomes and the sex chromosomes. However, we found thatremoving the sex chromosomes from the analysis did not eliminatethe bimodal behavior. Since, earlier studies reported a decrease incoverage in A/T rich regions with Illumina sequencing [21], weinvestigated the correlation between the GC content and coveragefor the three platforms, for potentially influencing bimodalbehavior. Figure 2 shows a significant variation in coverage withGC content; coverage by Illumina and SOLiD sequences isnotably lower in G/C rich regions. This variation with GCcontent, along with the expected haploid coverage on sexchromosomes, explains the observed bimodal behavior of these

distributions. Despite sharing the emulsion PCR approach in thesequencing template preparation with the SOLiD 3 system, onlythe Roche/454 FLX sequencing chemistry seemed to be immuneto this bias, as is demonstrated with only a minor correlationbetween GC content and coverage for 454 reads (Figure 2). Thebehavior exhibited by Illumina HiSeq sequences is in starkcontrast to the behavior of GA II reads, which exhibit a lowercoverage in A/T rich regions.

Detection of SNPs and IndelsA large portion of the genomic regions requires local

realignments due to the presence of indels in the sequencedgenome when compared to the reference genome. Indels can leadto alignment artifacts where a lot of bases around the indel do notagree with the reference and can masquerade as SNPs. We usedthe realignment tool in GATK [22] version 1.2–29 to realign thesequences from the Illumina and SOLiD dataset; followed by useof SAMtools version 0.1.16 to call variants. As described in the

Table 1. Sequencing and alignment statistics. Coverage is calculated with and without the putative PCR duplicates.

454 Illumina SOLiD

Number of reads generated 83,331,227 1,867,073,052 6,905,193,148

Number of bases generated 29,246,232,549 188,349,876,745 397,681,271,500

Read lengths 350 avg. single-end 101 paired-end 50 paired-end, 75 single-end

Number of reads aligned 82,310,265 (98.77%) 1,751,042,389 (93.79%) 4,429,505,837 (64.15%)

Number of bases aligned 28,732,501,185 (98.24%) 168,495,777,999 (89.46%) 224,998,686,646 (56.58%)

Coverage 10.04/9.78 X 58.89/55.06 X 78.63/53.20 X

Duplicate reads 2,211,903 115,528,614 1,216,108,795

Reference bases covered 2,781,827,482 2,858,458,440 2,850,277,778

The number of aligned reads includes the duplicate reads.doi:10.1371/journal.pone.0055089.t001

Figure 1. Depth of coverage distribution for the threeplatforms. The y-axis indicates the fraction of the bases in thereference sequence that has a particular coverage. This does notinclude secondary alignments and potential PCR duplicates. The dashedlighter curves depict the coverage distribution as calculated using aPoisson model for each sequencing technology.doi:10.1371/journal.pone.0055089.g001

Comparison of Sequencing Platforms

PLOS ONE | www.plosone.org 2 February 2013 | Volume 8 | Issue 2 | e55089

H/0-.%6(&=>/&;/d&7>$-0-;-0/;&L$-0&=>/&"6"9B;%;&A%A&6-=&/9%0%6"=/&=>/&1%0-A"9&1/>".%-$&&

H"="6&/=&"9+&)234&

Page 23: 01/$&)234& - ut

!"0&/=&"9+&)23)P&

<./$"(/&7-./$"(/&cXd&

\?&9/;;&#6%L-$0&%6&7-./$"(/&

©20

12 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

ANALYS I S

NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 1 JANUARY 2012 79

Extensive differences in variant callingWe sought to compare the sensitivity and accuracy of each platform for SNV calling. In total, 88.1% (3,295,023 out of 3,739,701) of the unique SNVs were concordant—that is, either a homozygous or heterozygous SNV was detected at the same locus by the two platforms in at least one sample (Fig. 2a). We detected 444,678 SNVs by only one platform or the other but not both, of which 345,100 were specific to Illumina (10.5% of the Illumina combined SNVs) and 99,578 were CG-specific (3.0% of the CG combined SNVs). Among the Illumina-specific SNVs, 67% were ‘no-calls’ (that is, not a reference or variant call), 11% were reference calls and 22% were other types of calls (that is, complex and substitution calls) in CG. Similarly, 75% of the CG-specific SNVs were no-calls in Illumina, and 25% were reference calls (Fig. 2b). The higher percentage of no-calls in Illumina is likely because GATK does not make the complex and substitution calls as does the CG pipeline.

To assess the quality of the calls, we used four criteria: the transition/transversion ratio (ti/tv), quality scores, the heterozygous/homozygous call ratio and novel, platform-specific SNVs. The ti/tv ratio of 2.1 for SNVs in humans has been described in several previous studies, including the 1000 Genomes Project11. The ti/tv ratio for all of the SNVs detected in these genomes was 2.04, but in our data the ti/tv of SNVs concordant between the two platforms was 2.14. For all SNVs detected by the Illumina platform, ti/tv was 2.05, but for SNVs specific to Illumina it was only 1.40. Similarly, for SNVs detected by CG, ti/tv was 2.13, but for CG-specific SNVs, it was 1.68. Thus, the ti/tv of concordant SNVs was very close to that expected, whereas the platform-specific ti/tv was much lower, suggesting that the platform-specific calls were of lower accuracy. Inspection of the quality scores of the platform-specific SNVs showed that they were indeed lower than those for the concordant calls (Supplementary Fig. 1). Furthermore, the heterozygous/homozygous call ratio was 1.48 for the concordant calls, whereas the platform-specific ratios were indeed higher: 2.48 for Illumina-specific calls and 1.98 for CG-specific calls.

To examine the fraction of novel platform-specific SNVs, we noted that 3,160,905 (96.0%) of the concordant SNVs were present in dbSNP131 (ref. 15). In contrast, only 260,108 (75.4%) of the SNVs in the Illumina-specific set, and 72,735 (73.0%) of the SNVs in the CG-specific set were present in dbSNP131. Thus, the platform-specific call sets were enriched for novel SNVs, suggesting that they likely contain more errors. In addition, the overall genotype concordance rate (that is, the proportion of concordant calls having a consistent genotype—heterozygous or homozygous—across both platforms) for the concordant SNVs was 98.9%. The high genotype concordance rate and percentage of known SNVs indicate that the concordant SNVs were of high quality and accuracy.

To further assess the accuracy of the variant calling, we sought to vali-date our SNVs by using Omni Quad 1M Genotyping arrays, traditional Sanger sequencing and Agilent SureSelect target enrichment capture followed by sequencing on an Illumina HiSeq for both samples. Of the 260,112 heterozygous calls detected with the Omni array, 99.5% were present in the entire SNV data set, 99.34% were concordant calls and only 0.16% were platform-specific SNVs. This demonstrates that both plat-forms are sensitive to known SNVs and that few known single-nucleotide polymorphisms (SNPs) are detected by only one platform.

To directly determine accuracy, we sequenced randomly selected concordant and platform-specific regions for Sanger sequencing. We found that 20 of 20 concordant SNVs could be validated, whereas 2 of 15 (13.3%) Illumina-specific and 17 of 18 (94.4%) CG-specific SNVs could be validated. This suggests CG has higher accuracy than Illumina and that almost all the concordant calls are correct.

To attempt to examine accuracy on a larger scale, we used Agilent SureSelect target enrichment capture technology to capture 33,084 (9.6%) Illumina-specific, 3,015 (3.0%) CG-specific and 24,247 (0.7%) concordant SNVs for sequencing on an Illumina Hi-Seq instrument (Table 2). We found that the validation rate for the concordant SNVs was 92.7%, whereas the validation rate was 61.9% and 64.3% for the CG-specific and Illumina-specific SNVs. These results indicate that the platform-specific calls have a very high false-positive rate of at least 35%. We also found that 12.6–21.4% of the targeted SNVs were not called in the validation, possibly owing to nonunique regions that are difficult to map precisely. Because the capture validation was performed using Illumina DNA sequencing technology, it is diffi-cult to directly compare the Illumina versus CG SNV rates with this approach. Nonetheless, these overall results indicate that concordant SNVs have high accuracy and platform-specific SNVs have a high false-positive rate.

Association of genes with variant calling differencesTo better understand the platform-specific calls, we investigated the association of SNVs from each platform with different genomic elements. We annotated both the platform-specific SNVs and con-cordant SNVs with gene and repeat annotations using Annovar16. In general, we did not find a significant difference between the associa-tions of the platform-specific SNVs and the concordant SNVs with gene elements, such as exons and introns (Fig. 3a,b). For example, 1% and 32–38% of the platform-specific SNVs were associated with exonic and intronic regions, respectively, regardless of the platform. This correlates well with the portions of exons (~1.3%) and introns (~37%) in the whole human genome. Nonetheless, the CG-specific SNVs had a slightly stronger association (14%) with noncoding RNA than the Illumina-specific SNVs (12%) and concordant SNVs (11%). Overall, many platform-specific SNVs lie in RNA coding regions of the human genome, and thus deducing their accuracy is of high importance.

Per

cent

age

of g

enom

e

100a

100

9080

80

7060

60

5040

40Cumulative read depth

3020

20

100

0

CompleteGenomicsIllumina

CompleteGenomicsIllumina

Num

ber

of b

ases

(M

)

b100

100Read depth

150 200

9080706050

50

40302010

00

Figure 1 Genome coverage at different read depths. (a) Percentage of genome covered by different read depths in different platforms. (b) Histogram of genome coverage at different read depths.

Table 1 Whole-genome sequencing using CG and Illumina platformsCG Illumina

Sample Bases (Gb) Coverage (×) Bases (Gb) Coverage (×) Reads Mapped After duplicate removal

Blood 233.2 78 151.4 50 1,499,021,500 1,367,988,241 91% 1,233,937,084 82%Saliva 218.6 73 307.1 102 3,040,306,840 2,614,663,882 86% 2,354,594,740 77%Total 451.8 151 458.5 153 4,539,328,340 3,982,652,123 88% 3,588,531,824 79%

©20

12 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

ANALYS I S

NATURE BIOTECHNOLOGY VOLUME 30 NUMBER 1 JANUARY 2012 79

Extensive differences in variant callingWe sought to compare the sensitivity and accuracy of each platform for SNV calling. In total, 88.1% (3,295,023 out of 3,739,701) of the unique SNVs were concordant—that is, either a homozygous or heterozygous SNV was detected at the same locus by the two platforms in at least one sample (Fig. 2a). We detected 444,678 SNVs by only one platform or the other but not both, of which 345,100 were specific to Illumina (10.5% of the Illumina combined SNVs) and 99,578 were CG-specific (3.0% of the CG combined SNVs). Among the Illumina-specific SNVs, 67% were ‘no-calls’ (that is, not a reference or variant call), 11% were reference calls and 22% were other types of calls (that is, complex and substitution calls) in CG. Similarly, 75% of the CG-specific SNVs were no-calls in Illumina, and 25% were reference calls (Fig. 2b). The higher percentage of no-calls in Illumina is likely because GATK does not make the complex and substitution calls as does the CG pipeline.

To assess the quality of the calls, we used four criteria: the transition/transversion ratio (ti/tv), quality scores, the heterozygous/homozygous call ratio and novel, platform-specific SNVs. The ti/tv ratio of 2.1 for SNVs in humans has been described in several previous studies, including the 1000 Genomes Project11. The ti/tv ratio for all of the SNVs detected in these genomes was 2.04, but in our data the ti/tv of SNVs concordant between the two platforms was 2.14. For all SNVs detected by the Illumina platform, ti/tv was 2.05, but for SNVs specific to Illumina it was only 1.40. Similarly, for SNVs detected by CG, ti/tv was 2.13, but for CG-specific SNVs, it was 1.68. Thus, the ti/tv of concordant SNVs was very close to that expected, whereas the platform-specific ti/tv was much lower, suggesting that the platform-specific calls were of lower accuracy. Inspection of the quality scores of the platform-specific SNVs showed that they were indeed lower than those for the concordant calls (Supplementary Fig. 1). Furthermore, the heterozygous/homozygous call ratio was 1.48 for the concordant calls, whereas the platform-specific ratios were indeed higher: 2.48 for Illumina-specific calls and 1.98 for CG-specific calls.

To examine the fraction of novel platform-specific SNVs, we noted that 3,160,905 (96.0%) of the concordant SNVs were present in dbSNP131 (ref. 15). In contrast, only 260,108 (75.4%) of the SNVs in the Illumina-specific set, and 72,735 (73.0%) of the SNVs in the CG-specific set were present in dbSNP131. Thus, the platform-specific call sets were enriched for novel SNVs, suggesting that they likely contain more errors. In addition, the overall genotype concordance rate (that is, the proportion of concordant calls having a consistent genotype—heterozygous or homozygous—across both platforms) for the concordant SNVs was 98.9%. The high genotype concordance rate and percentage of known SNVs indicate that the concordant SNVs were of high quality and accuracy.

To further assess the accuracy of the variant calling, we sought to vali-date our SNVs by using Omni Quad 1M Genotyping arrays, traditional Sanger sequencing and Agilent SureSelect target enrichment capture followed by sequencing on an Illumina HiSeq for both samples. Of the 260,112 heterozygous calls detected with the Omni array, 99.5% were present in the entire SNV data set, 99.34% were concordant calls and only 0.16% were platform-specific SNVs. This demonstrates that both plat-forms are sensitive to known SNVs and that few known single-nucleotide polymorphisms (SNPs) are detected by only one platform.

To directly determine accuracy, we sequenced randomly selected concordant and platform-specific regions for Sanger sequencing. We found that 20 of 20 concordant SNVs could be validated, whereas 2 of 15 (13.3%) Illumina-specific and 17 of 18 (94.4%) CG-specific SNVs could be validated. This suggests CG has higher accuracy than Illumina and that almost all the concordant calls are correct.

To attempt to examine accuracy on a larger scale, we used Agilent SureSelect target enrichment capture technology to capture 33,084 (9.6%) Illumina-specific, 3,015 (3.0%) CG-specific and 24,247 (0.7%) concordant SNVs for sequencing on an Illumina Hi-Seq instrument (Table 2). We found that the validation rate for the concordant SNVs was 92.7%, whereas the validation rate was 61.9% and 64.3% for the CG-specific and Illumina-specific SNVs. These results indicate that the platform-specific calls have a very high false-positive rate of at least 35%. We also found that 12.6–21.4% of the targeted SNVs were not called in the validation, possibly owing to nonunique regions that are difficult to map precisely. Because the capture validation was performed using Illumina DNA sequencing technology, it is diffi-cult to directly compare the Illumina versus CG SNV rates with this approach. Nonetheless, these overall results indicate that concordant SNVs have high accuracy and platform-specific SNVs have a high false-positive rate.

Association of genes with variant calling differencesTo better understand the platform-specific calls, we investigated the association of SNVs from each platform with different genomic elements. We annotated both the platform-specific SNVs and con-cordant SNVs with gene and repeat annotations using Annovar16. In general, we did not find a significant difference between the associa-tions of the platform-specific SNVs and the concordant SNVs with gene elements, such as exons and introns (Fig. 3a,b). For example, 1% and 32–38% of the platform-specific SNVs were associated with exonic and intronic regions, respectively, regardless of the platform. This correlates well with the portions of exons (~1.3%) and introns (~37%) in the whole human genome. Nonetheless, the CG-specific SNVs had a slightly stronger association (14%) with noncoding RNA than the Illumina-specific SNVs (12%) and concordant SNVs (11%). Overall, many platform-specific SNVs lie in RNA coding regions of the human genome, and thus deducing their accuracy is of high importance.

Pe

rce

nta

ge

of

ge

no

me

100a

100

9080

80

7060

60

5040

40Cumulative read depth

3020

20

100

0

CompleteGenomicsIllumina

CompleteGenomicsIllumina

Nu

mb

er

of

ba

ses

(M)

b100

100Read depth

150 200

9080706050

50

40302010

00

Figure 1 Genome coverage at different read depths. (a) Percentage of genome covered by different read depths in different platforms. (b) Histogram of genome coverage at different read depths.

Table 1 Whole-genome sequencing using CG and Illumina platformsCG Illumina

Sample Bases (Gb) Coverage (×) Bases (Gb) Coverage (×) Reads Mapped After duplicate removal

Blood 233.2 78 151.4 50 1,499,021,500 1,367,988,241 91% 1,233,937,084 82%Saliva 218.6 73 307.1 102 3,040,306,840 2,614,663,882 86% 2,354,594,740 77%Total 451.8 151 458.5 153 4,539,328,340 3,982,652,123 88% 3,588,531,824 79%

I!&

!"0&/=&"9+&)23)&

Page 24: 01/$&)234& - ut

^"$%"]-6&-L&7-./$"(/&T%=>&?\&7-6=/6=&&

previous section, Newbler handles the 454 data in flow-space andwe used Newbler version 2.3 to call variants in it (See Methods).While the three sequencing approaches resulted in a similarnumber of single substitution variants, namely 4,331,131 variantsfor the 454 data, 4,691,363 variants using the Illumina data, and4,145,208 variants using the SOLiD sequences, the combinationresulted in a total of 5,252,985 potential variant locations.However, only a common set of 3,401,954 variant locations wasshared between all three technologies, whereas only one or two ofthe platforms supported the remaining 1,851,031 locations(Figure 3a). As for indels, we found 614,794 indels using the 454data, 554,138 small indels using the Illumina data and 303,937potential indels using the SOLiD data.

Reasons for Failure of Detection of SubstitutionsThe 1,851,031 discrepant locations between 454 FLX, Illumina

HiSeq and SOLiD 3 sequences allow for the study the falsepositive and the false negative rates for the obtained variant calls.Considering the SNP calls from each platform as independentevidence, we use the locations where two platforms agree to studythe false negatives for the third one. This allows us to quantify andunderstand the reasons why this SNP was not called in the datasetsequenced using the third platform. Similarly, at the locationswhere only one platform calls a variant, we identify and study thefalse positives for that platform, barring exceptions that areexplained later in the text.

Figure 2. Variation of coverage with GC content in the three sequencing technologies. The red line shows the mean coverage across thewhole genome. Each point on the plot reflects the mean coverage and fraction of GC content in 50 kbp non-overlapping window. The y-axis showsthe coverage whereas the x-axis shows the fraction of C, G nucleotides in the window. This does not include secondary alignments and potential PCRduplicates.doi:10.1371/journal.pone.0055089.g002

Comparison of Sequencing Platforms

PLOS ONE | www.plosone.org 3 February 2013 | Volume 8 | Issue 2 | e55089

H"="6&/=&"9+&)234&

Page 25: 01/$&)234& - ut

^"$%"]-6&-L&7-./$"(/&T%=>&?\&7-6=/6=&&

previous section, Newbler handles the 454 data in flow-space andwe used Newbler version 2.3 to call variants in it (See Methods).While the three sequencing approaches resulted in a similarnumber of single substitution variants, namely 4,331,131 variantsfor the 454 data, 4,691,363 variants using the Illumina data, and4,145,208 variants using the SOLiD sequences, the combinationresulted in a total of 5,252,985 potential variant locations.However, only a common set of 3,401,954 variant locations wasshared between all three technologies, whereas only one or two ofthe platforms supported the remaining 1,851,031 locations(Figure 3a). As for indels, we found 614,794 indels using the 454data, 554,138 small indels using the Illumina data and 303,937potential indels using the SOLiD data.

Reasons for Failure of Detection of SubstitutionsThe 1,851,031 discrepant locations between 454 FLX, Illumina

HiSeq and SOLiD 3 sequences allow for the study the falsepositive and the false negative rates for the obtained variant calls.Considering the SNP calls from each platform as independentevidence, we use the locations where two platforms agree to studythe false negatives for the third one. This allows us to quantify andunderstand the reasons why this SNP was not called in the datasetsequenced using the third platform. Similarly, at the locationswhere only one platform calls a variant, we identify and study thefalse positives for that platform, barring exceptions that areexplained later in the text.

Figure 2. Variation of coverage with GC content in the three sequencing technologies. The red line shows the mean coverage across thewhole genome. Each point on the plot reflects the mean coverage and fraction of GC content in 50 kbp non-overlapping window. The y-axis showsthe coverage whereas the x-axis shows the fraction of C, G nucleotides in the window. This does not include secondary alignments and potential PCRduplicates.doi:10.1371/journal.pone.0055089.g002

Comparison of Sequencing Platforms

PLOS ONE | www.plosone.org 3 February 2013 | Volume 8 | Issue 2 | e55089

H"="6&/=&"9+&)234&

Page 26: 01/$&)234& - ut

,#01/$;&-L&A/=/7=/A&."$%"6=;&

•  UWUP&UQ443Q343&

•  I99#0%6"P&UQX*3Q4X4&

•  'F!%JP&UQ3UWQ)2V&

•  G-="9&7-01%6/AP&WQ)W)Q*VW&&

•  '>"$/A&1/=T//6&=>$//P&4QU23Q*WU&_XU+VY`&

H"="6&/=&"9+&)234&

Page 27: 01/$&)234& - ut

,#01/$;&-L&A/=/7=/A&."$%"6=;&n&W+)W&K&=-="9&

The reasons why a SNP is not detected by one sequencingtechnology, whereas it is reported by another, can be broadlydivided into three categories:

N Issues related to coverage: These can be furthersubdivided into complete lack of coverage, low coverage(which is not enough to call a SNP based on predefinedcriteria), and higher-than-expected coverage (based on amodel used to separate SNPs from structural variants andassembly errors) at the candidate location.

N Issues with the alternate allele: Most software tools(including SAMtools and Newbler) require observing thealternate allele at least twice or more, before they consider thelocation as a potential variant. These can be further subdividedinto instances where the alternate allele is not seen at all andothers, when the alternate allele is not seen a sufficient numberof times.

N Issues with the variant calling: These refer to thesituations where the alternate allele is seen a requisite numberof times, but the SNP is not called due to other reasons. Thesereasons may include proximity to many other SNPs, proximityto a high quality indel, existence in a non-uniquely alignableregion, and a huge deviation from the expected diploidbehavior of the sample for the data aligned using BWA. Forthe reads aligned using Newbler, the reasons include thelocation being in a non-uniquely alignable region and otheralignment errors that arise due to the unique error-profile ofthe 454 reads.

We investigated the alignments at the 439,122 locations thatwere called as putative variants by using 454 and Illuminasequences, but not using SOLiD sequences (Figure 4a i). Weassigned each location to a particular category based on the reasonwhy it was not called a SNP. We found that the variant allele wasobserved in the SOLiD reads in 64% of these cases, but the SNPwas filtered away for various reasons. 27% of the locations werefiltered away due to a low SNP quality (defined as the Phred-scaledlikelihood that the called genotype is identical to the reference),18% of them were filtered away due to a low RMS (root meansquare) mapping quality (reflecting the limitation of shorter reads)and another 19% were filtered away as the variant allele was notseen enough number of times. Coverage related issues (nocoverage, too little coverage or more than expected coverage)

were responsible for another 19% of the locations. The alternateallele was not seen at all, despite adequate coverage at the site, forthe remaining 17% locations.For the 71,567 locations that were called using the SOLiD

sequences (but not by others), we looked at the alignments for boththe 454 dataset and the Illumina datasets. At about 15% of theselocations (Figure 4a ii), the alternate allele was seen just once in the454 dataset and at about another 16% of them, the coverage of454 reads was not enough to call a SNP. For another 21% of thelocations the SNP was not called by Newbler, even though theallele was seen multiple times in the pairwise alignments betweenthe reference and the 454 reads, with most of them beingassociated with homopolymer errors. On the other hand at 25% ofthese locations the SNP was seen in the Illumina dataset (Figure 4aiii), but it was filtered away due to a lower SNP quality (15%), orbecause lower mapping quality (9%). Another 14% of theselocations did not have sufficient coverage with Illumina reads to beconsidered in SNP calling. Considering the locations where both454 and Illumina had little, no, or higher than expected coverage,and where the alternate allele was seen at least once in either 454or Illumina dataset as true SNPs, we expect 14,707 of the 71,567locations to be false-positives for the SOLiD calls.When we looked at the 47,381 locations that were called a SNP

using 454 and SOLiD reads, we found that primary reason (at60% of the locations) these were not called a SNP with Illuminareads had to do with the coverage (Figure 4b i). 57% of thelocations were in regions where the coverage was more thanexpected (signaling a putative structural variant), whereas therewas little of no coverage for the remaining 3%. We used a Poissondistribution with the same mean value to calculate the coveragethreshold to filter variants, but this data suggests that a gammadistribution with more weight on more tails is probably a bettermodel for Illumina data. The second largest contributor was lowSNP quality (22% of the locations), which is the result of anobserved deviation from the expectation that both allele should beseen approximately the same number of times on a heterozygouslocation.We found 225,981 locations that were called as putative variants

using Illumina reads only. Looking at the alignments for theSOLiD reads at those locations (Figure 4b ii), we found that for22% of them we saw the alternate allele a sufficient number oftimes, but it was filtered away either due to low RMS mappingquality or a low SNP quality. Another 16% of the locations were

Figure 3. Venn diagram showing the overlap in the SNP calls made using data from the three sequencing technologies. We displaythe sizes of each of the seven categories of overlaps among the variant calls in the three technologies. (a) depicts the overlaps when all substitutioncalls are used, (b) depicts the overlaps when all calls from Illumina and SOLiD are used but only the high-confidence subset of the 454 dataset is used,and (c) depicts the overlaps when only the variants in the uniquely alignable regions of the reference sequence are used.doi:10.1371/journal.pone.0055089.g003

Comparison of Sequencing Platforms

PLOS ONE | www.plosone.org 4 February 2013 | Volume 8 | Issue 2 | e55089

H"="6&/=&"9+&)234&

Page 28: 01/$&)234& - ut

•  l;/9L[0";h%6(S&=-&%A/6]LB&=>/&$/(%-6;&%6&=>/&$/L/$/67/&(/6-0/&T>/$/&$/"A;&;>-#9A&"9%(6&#6%5#/9B&n&!<'Go&&

•  $/L/$/67/&(/6-0/&1$-h/6&%6=-&L$"(0/6=;&-L&9/6(=>&/5#"9&=-&=>/&9/6(=>&-L&=>/&$/"A;&&

•  i$"(0/6=;&0"88/A&1"7h&=-&$/L/$/67/&

•  \-6;%A/$%6(&-69B&#6%5#/9B&0"88"19/&$/(%-6;&

•  ^/$B&#;/L#9&L-$&UWU&A"="&

•  !/;;&;-&L-$&I99#0%6"&"6A&'F!%J&

H"="6&/=&"9+&)234&

Page 29: 01/$&)234& - ut

^"9%A"]-6&#;%6(&'/5#/6-0&K";;&'8/7=$-0/=$B&&

than one platform was higher when compared to the rate from theindividual platforms, e.g. the validation rate for variants supportedby 454 and Illumina was 78% whereas the validation rate forvariants supported only by 454 was 50% and the validation ratefor variants supported only by Illumina was 69%. The validationrate for SNPs called using 454 and SOLiD was 80% and it was89% for variants called by Illumina and SOLiD. The longer readsgenerated by 454 resulted in some SNPs with 50 bp flanks (asrequired by the Sequenom assay) in repeat regions that have ahigher dimer/hairpin potential when compared to the flanks forSNPs in the other two platforms. As a result, we saw significantlymore primer design and assay failures for the 454 derived SNPswhen compared to SNPs in the two short read technologies.Furthermore, the number of assays and primers that failed wasalso notably higher for locations that had been called by only oneof the sequencing technologies.

Discussion

With the ultimate goal to sequence and analyze a humangenome as fast and cost-effective as possible, it is highly desirableto simplify sampling and library preparation steps, but also tocarry out DNA-sequencing on a single technology platform. Infact, with the availability of a high quality reference genome, re-sequencing with ‘‘low-cost per base’’ technologies has been at thecenter of recent efforts of most academic and commercialsequencing endeavors. However, with hundreds of humangenomes currently being deciphered, the question of completenessrelative to today’s technological possibilities has become secondaryto other pursuits. Faced with increasing evidence that humangenomics does not provide allelic association to critical humanphenotypes at predicted rates, the question is being raised whethermissed genetic variants might be causal due to unmapped regions

of the genome. Further, understanding interactions of so farunrecognized genetic variations with small molecule drugs is ofincreasing interest to pharmaceutical and diagnostic industry. Ouranalysis on the false negative rate of SNP identification suggeststhat a significant number of variants are not reported using a singlesequencing platform, limiting insights which could enable a morecomplete understanding of the human genome. On the otherhand, false positives from a single sequencing technology lead to acontinuous degradation of the quality of SNP databases via theinclusion of non-existing genetic variants. Using a non-sequencingapproach we report that the validation of SNP calls is significantlyimproved for variants that are reported using more than a singlesequencing platform.We realize that every lab and sequencing platform has a ‘‘best

practice’’ protocol from sample-prep to the bioinformaticspipeline, and that all the steps heavily influence any conclusionthat one attempts to derive in such an experiment. In this study weshow evidence that each of the three applied sequencingtechnologies contributes, at the respective coverage of eachdataset, an additional unique 71,000 to 443,000 variants (1.4–8.9%) of the total of 5 million found in the human individual KB1.Remarkably, at least 1.4% of these technology-dependent variantswould have gone unnoticed, even if the genome were sequencedon two of the three platforms. Furthermore as evident usingSequenom mass spectrometry, the validation rates for variantsusing two platforms are improved by more than 10% over thevalidation rates for the individual platforms. It therefore seemshighly beneficial to sequence at least reference genomes from eachgeographic region by the multi-platform approach presented here.A larger number of multi-platform human reference genomeswould not only minimize systematic technological biases, but alsoreduce ethnical biases from today’s human genome referencesequence.In this comparative study we looked at nine factors that affect

the ability of a sequencing platform to accurately call variants inthe human genome (see Figure 4 a–c). That include coveragerelated issues, allele related issues where the alternate allele wasnot seen multiple times and SNP related issues where the SNPswere filtered away in an attempt to reduce the false-positives.Among all the factors impacting the probability of a variant to becalled, one of the most important one is the unbiased distributionof reads across a genome. As shown in Figure 2, the Roche/454data is much more uniformly aligned, independent from the GCcontent of the genome than the short read platforms, resulting incomparable number of variant calls despite a much lower overallcoverage. This trend however is counteracted by the 454 Rochespecific error model that is prone to inaccurately assess the lengthof a given homopolymer, leading to potential false-positive SNPscalls, as shown by calls made only by 454 sequencing (Figure 3a).In addition to the platform specific errors, systematic errors areintroduced through the varying algorithms and parameters used inthe process of variant calling. This becomes apparent when theRoche/454 dataset is reanalyzed to include only the ‘‘high-confidence’’ variants, which reduces the total number of detectedNewbler SNPs by 21%. The Roche/454 specific SNPs arereduced this way by more than 51% (226,879 calls, down from442,674) (Figure 3b), however at the cost of increasing potentialfalse negative prediction, as is apparent in a largely increasednumber of calls that are seen only by the Illumina and SOLiDplatform (1,447,828 calls, up from 624,306). In order to avoid theconflict of Newbler’s two modes of that either over-predicts false-negative or under-predicts true positive variants, we calculated theuniquely mappable region of the human genome for each datasetspecific on the platform’s read length (Roche/454 length 350 bp,

Figure 5. SNP Validation using Mass spectroscopy. Validation of300 putative SNP locations from each of the six sets of SNP calls inFigure 3a, where not all three technologies agree on the computedgenotype. The categories on x-axis are ‘‘454’’ (SNPs called by 454 only),‘‘Illumina’’ (SNPs called by Illumina only), ‘‘SOLiD’’ (SNPs called by SOLiDonly), ‘‘454 & Illumina’’ (SNPs called by 454 and Illumina), ‘‘454 & SOLiD’’(SNPs called by 454 and SOLiD), ‘‘Illumina & SOLiD’’ (SNPs called byIllumina and SOLiD). The color categories include ‘‘Primer Failure’’(Primer extension failure), ‘‘Assay Failure’’ (Assay Failure), ‘‘Validated’’and ‘‘Not Validated’’.doi:10.1371/journal.pone.0055089.g005

Comparison of Sequencing Platforms

PLOS ONE | www.plosone.org 7 February 2013 | Volume 8 | Issue 2 | e55089

H"="6&/=&"9+&)234&

Page 30: 01/$&)234& - ut

@-;=/$&"=&<'R?&)234&%6&a-;=-6&

\-089/=/&?/6-0%7;&.;+&I99#0%6"&

\-67-$A"67/&-L&1";/&7"99;&%6&8-=/6]"9&',@&8-;%]-6;&

!"#$%&'""(3Q)Q&M9.%&?/$;=&G"9";3Q&K"$%-&K%p3Q4Q&H//A%h&Kq(%4Q&'%0-6&H";0#;;/6UQ&H%7>"$A&^%99/0;3Q)&"6A&K"%=&K/=;8"9#)&

)59N217?&A64O=/J&P14@&17&75<QR?575/4S=7&@5TN57217?0&&

Page 31: 01/$&)234& - ut

-QN98&95@1?70&

•  ?/6-0/;&-L&c&%6A%.%A#"9;&;/5#/67/A&-6&1-=>&89":-$0;+&

•  \-67-$A"67/&0/";#$/0/6=;&8$/;/6=/A&L-$&U&;"089/;&T%=>&(--A&"6A&;%0%9"$&;/5#/67/&;="];]7;+&

•  \?P&"./$"(/&7-./$"(/&U2dQ&8$-7/;;/A&T%=>&\?&8%8/9%6/+&

•  I99#0%6"P&"./$"(/&7-./$"(/&)XdQ&0"88%6(&T%=>&ab<Q&0#9];"089/&7"99%6(&T%=>&'<K=--9;+&

Page 32: 01/$&)234& - ut

-QN98&95@1?70&

•  @-;%]-6;&;/9/7=/A&T>/$/&',@&-77#$$/A&%6&"=&9/";=&-6/&%6A%.%A#"9&%6&"&A"=";/=&-L&)U*&>#0"6;&L$-0&."$%-#;&8-8#9"]-6;&T-$9AT%A/&_;/5#/67/A&1B&\?`+&'%=/;&T%=>&-=>/$&=B8/;&-L&0#="]-6;&/d79#A/A+&

•  i$-0&=>%;&9%;=&-L&7"+&W2&0%99%-6&',@&;%=/;Q&T/&;/9/7=/A&=>-;/&L-$&T>%7>&%6L-$0"]-6&_/%=>/$&$/L/$/67/Q&."$%"6=&-$&6-[7"99`&T";&"."%9"19/&L$-0&1-=>&89":-$0;&"6A&"99&;"089/;+&

•  +3N@U&=N/&2=JA4/1@=7&1@&P4@59&=7&Q35&2466&2=72=/94725&17&A/595B759&@1Q5@U&479&7=Q&=7&Q35&=>5/64A&=V&>4/147QR=768&94Q40&IL&"&;%=/&%;&A%;7-$A"6=&1/=T//6&89":-$0;Q&%=&"88/"$;&";&8$%."=/&L-$&1-=>+&

Page 33: 01/$&)234& - ut

'71S46&B6Q5/17?0&&

•  \?P&^eRI?R&_@<''`&O9=/$+&

•  I99#0%6"P&H/"A&A/8=>&0%6&32Q&0"d&322f&0"9/;&kQ&E&7>$-0-;-0/;&0%6&WQ&0"d&W2+&W4/147Q@&17&/5A54Q@&/5J=>590&

•  <01%(#-#;&1";/;&_,`&%6&7-08"$/A&O9/;&0"B&1/&/%=>/$&6-[7"99;&-$&O9=/$/A+&

Page 34: 01/$&)234& - ut

'71S46&B6Q5/17?0&&

•  \?P&^eRI?R&_@<''`&O9=/$+&

•  I99#0%6"P&H/"A&A/8=>&0%6&32Q&0"d&322f&0"9/;&kQ&E&7>$-0-;-0/;&0%6&WQ&0"d&W2+&W4/147Q@&17&/5A54Q@&/5J=>590&

•  <01%(#-#;&1";/;&_,`&%6&7-08"$/A&O9/;&0"B&1/&/%=>/$&6-[7"99;&-$&O9=/$/A+&

%1@J4AA17?&A=@1S=7@&B6Q5/0&

•  b/&0";h/A&=>/&A"="&T%=>&T%6A-T;&%6&=>/&(/6-0/&T>/$/&0"88%6(&-L&I99#0%6"&$/"A;&-r/6&L"%9;+&

•  <889%7"1%9%=B&-L&=>/;/&19"7h9%;=;&=-&\?&A"="&T";&6-=&h6-T6+&

Page 35: 01/$&)234& - ut

•  G-="9&UV+X&K&;%=/;Q&4*&K&#6"01%(#-#;&_6-=&,`&%6&1-=>&89":-$0;+&

•  )*+)&K&;%=/;&7-67-$A"6=&%6&"99&U&;"089/;Q&)V+V&K&#6"01%(#-#;+&

•  *>5/4?5@&=>5/&M&@4JA65@Q&(%./6&L-$&#6"01%(#-#;&A%;7-$A"67/;&_8/$7/6=&$"6(/;&s2+3`&

,!D'C&\-67-$A"6=P&;"0/&1";/Q&;"0/&jB(-7%=BQ&6-=&,&%6&

/%=>/$&89":-$0;+&I6%]"9&O9=/$;P&_4*+)&K&=-="9`&_3+W&K&,;`&

4c+c&K&#6"01%(+&H:UFG&

K%;0"88%6(&O9=/$P&_)V+c&K&=-="9`&

_3&K&,;`&)c+c&K&#6"01%(+&

H:UKG&

,!&8$%."=/&

I6%]"9&O9=/$;P&_*+U&K&=-="9`&_3+c&K&,;`&

3+4&K&#6"01%(+&I0IG&

K%;0"88%6(&O9=/$P&

_X+V&K&=-="9`&_2+VV&K&,;`&

2+*3&K&#6"01%(+&

I0EG&

'C&8$%."=/&

I6%]"9&O9=/$;P&_*+U&K&=-="9`&_X+W&K&,;`&3+4&K&#6"01%(+&I0IG&

K%;0"88%6(&O9=/$P&_X+V&K&=-="9`&_W&K&,;`&2+*3&K&#6"01%(+&I0EG&

Page 36: 01/$&)234& - ut

,!D'C&\-67-$A"6=P&;"0/&1";/Q&;"0/&jB(-7%=BQ&6-=&,&%6&

/%=>/$&89":-$0;+&I6%]"9&O9=/$;P&_4*+)&K&=-="9`&_3+W&K&,;`&

4c+c&K&#6"01%(+&H:UFG&

K%;0"88%6(&O9=/$P&_)V+c&K&=-="9`&

_3&K&,;`&)c+c&K&#6"01%(+&

H:UKG&

,!&8$%."=/&

I6%]"9&O9=/$;P&_*+U&K&=-="9`&_3+c&K&,;`&

3+4&K&#6"01%(+&

I0IG&

K%;0"88%6(&O9=/$P&

_X+V&K&=-="9`&_2+VV&K&,;`&

2+*3&K&#6"01%(+&

I0EG&

'C&8$%."=/&

I6%]"9&O9=/$;P&_*+U&K&=-="9`&_X+W&K&,;`&3+4&K&#6"01%(+&I0IG&

K%;0"88%6(&O9=/$P&_X+V&K&=-="9`&_W&K&,;`&2+*3&K&#6"01%(+&I0EG&

Page 37: 01/$&)234& - ut

•  X==/RJ4AA17?&Y179=Y@&=768&J171J4668&/59N25&Q35&P14@0&

•  G>/&1%";&%6&=>/;/&8-;%]-6;&%;&6-=&($/"=/$&=>"6&/9;/T>/$/Q&;#((/;]6(&=>"=&%L&/$$-$;&"$/&%6A//A&8$/;/6=&%6&=>/;/&8-;%]-6;&%6&-#$&I99#0%6"&A"="&=>/6&=>/&7#$$/6=&./$;%-6&-L&-#$&T%6A-T;&0"B&"9;-&"889B&=-&\?+&

Page 38: 01/$&)234& - ut

,=72=/947Q&A=@1S=7@&B6Q5/&479&4991S=746&I&@4JA65@0&

•  )*+)&K&;%=/;&T/$/&7-67-$A"6=&%6&"99&U&;"089/;+&

•  b/&#;/A&=>/;/&";&"&T>%=/9%;=&=-&O9=/$&=>/&;/5#/67/;&-L&-=>/$&4&;"089/;&=>"=&>"A&9-T/$&"./$"(/&5#"9%=B&"6A&9"$(/&6#01/$;&-L&"01%(#-#;&;%=/;+&

Page 39: 01/$&)234& - ut

,=72=/947Q&A=@1S=7@&B6Q5/&479&4991S=746&I&@4JA65@0&

•  )*+)&K&;%=/;&T/$/&7-67-$A"6=&%6&"99&U&;"089/;+&

•  b/&#;/A&=>/;/&";&"&T>%=/9%;=&=-&O9=/$&=>/&;/5#/67/;&-L&-=>/$&4&;"089/;&=>"=&>"A&9-T/$&"./$"(/&5#"9%=B&"6A&9"$(/&6#01/$;&-L&"01%(#-#;&;%=/;+&

•  $7&4>5/4?5U&V/=J&EE0I&%&N74JP1?N=N@&@1Q5@&17&P=Q3&A64O=/J@&LK0L&/5J41759&ZLHG[&479&Q35&P14@&9/=AA59&Q=&L0MG&V/=J&I0MG0&

•  .5@A1Q5&6=Y5/&TN461Q8U&Q35&4991S=746&I&@4JA65@&@3=Y59&@Q4P65&479&>5/8&@1J164/&P14@&A5/257Q4?5&4@&Q35&=Q35/&V=N/\&4+UY&"r/$&%6%]"9&O9=/$%6(Q&4+4Y&T%=>&0%;0"88%6(&O9=/$&"6A&4+3Y&T%=>&1%";/A&8-;%]-6;&O9=/$+&