evaluating the quality of the 1000 genomes project data · 10 abstract data from the 1000 genomes...
TRANSCRIPT
*Correspondence: [email protected]
Evaluatingthequalityofthe1000Genomes1
Projectdata 2
SaurabhBelsare1,MichalSakin-Levy2,YuliaMostovoy2,SteffenDurinck3,SubhraChaudhry3,MingXiao4,Andrew3
S.Peterson3,Pui-YanKwok1,2,5,SomasekarSeshagiri3andJeffreyD.Wall1,6,*4 1InstituteforHumanGenetics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,94143,USA,2DepartmentofDermatology,5
UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,94143,USA,3DepartmentofMolecularBiology,GenentechInc.,1DNAWay,6 SouthSanFrancisco,CA,94080,USA,4SchoolofBiomedicalScience,Engineering,andHealthSystems,DrexelUniversity,Philadelphia,7
PA,19104,USA,5CardiovascularResearchInstitute,SanFrancisco,SanFrancisco,CA,94143,USA,6DepartmentofEpidemiologyand8 Biostatistics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,94143,USA9
ABSTRACTDatafromthe1000Genomesprojectisquiteoftenusedasareferenceforhumangenomic10
analysis.However,itsaccuracyneedstobeassessedtounderstandthequalityofpredictionsmadeusingthis11
reference.Wepresenthereanassessmentofthegenotype,phasing,andimputationaccuracydatainthe100012
Genomesproject.Wecomparethephasedhaplotypecallsfromthe1000Genomesprojecttoexperimentally13
phasedhaplotypesfor28ofthesameindividualssequencedusingthe10XGenomicsplatform.Weobserve14
thatphasingandimputationforrarevariantsareunreliable,whichlikelyreflectsthelimitedsamplesizeof15
the1000Genomesprojectdata.Further,itappearsthatusingapopulationspecificreferencepaneldoesnot16
improvetheaccuracyofimputationoverusingtheentire1000Genomesdatasetasareferencepanel.We17
alsonotethattheerrorratesandtrendsdependonthechoiceofdefinitionoferror,andhenceanyerror18
reportingneedstotakethesedefinitionsintoaccount. 19
INTRODUCTION 20
The1000GenomesProject(1KGP)wasdesignedtoprovideacomprehensivedescriptionofhumangeneticvariation21
throughsequencingmultipleindividuals1-3.Specifically,the1KGPprovidesalistofvariantsandhaplotypesthatcanbe22
usedforevolutionary,functionalandbiomedicalstudiesofhumangenetics.Overthethreephasesofthe1KGP,atotalof23
2504individualsacross26populationsweresequenced.Thesepopulationswereclassifiedinto5majorcontinental24
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
2
groups:Africa(AFR),America(AMR),Europe(EUR),EastAsia(EAS),andSouthAsia(SAS).The1KGPdatawasgenerated25
usingacombinationofmultiplesequencingapproaches,includinglowcoveragewholegenomesequencingwithmean26
depthof7.4X,deepexomesequencingwithameandepthof65.7X,anddensemicroarraygenotyping.Inaddition,asubset27
ofindividuals(427)includingmother-father-childtriosandparent-childduosweredeepsequencedusingtheComplete28
Genomicsplatformatahighcoveragemeandepthof47X.Theprojectinvolvedcharacterizationofbiallelicand29
multiallelicSNPs,indels,andstructuralvariants.30
Giventhelowdepthof(sequencing)coverageformost1KGPsamples,itisunclearhowaccuratetheimputedhaplotypes31
are,especiallyforrarevariants.Wequantifythisaccuracydirectlybycomparingimputedgenotypesandhaplotypes32
basedonlow-coveragewhole-genomesequencedatafromthe1KGPwithhighlyaccurate,experimentallydetermined33
haplotypesfrom28ofthesamesamples.Additionalmotivationforourstudyisgivenbelow. 34
PhasingItisimportanttounderstandphaseinformationinanalyzinghumangenomicdata.Phasinginvolvesresolving35
haplotypesforsitesacrossindividualwholegenomesequences.Theterm’diplomics’4hasbeencoinedtodescribe36
"scientificinvestigationsthatleveragephaseinformationinordertounderstandhowmolecularandclinicalphenotypesare37
influencedbyuniquediplotypes".Thediplotypeshowseffectsinfunctionanddiseaserelatedphenotypes.Multiple38
phenomenalikeallele-specificexpression,compoundheterozygosity,inferringhumandemographichistory,and39
resolvingstructuralvariantsrequiresanunderstandingofthephaseofavailablegenomicdata.Phasedhaplotypesare40
alsorequiredasanintermediatestepforgenotypeimputation. 41
Phasingmethodscanbecategorizedintomethodswhichuseinformationfrommultipleindividualsandthosewhichrely42
oninformationfromasingleindividual5.Theformerareprimarilycomputationalmethods,whilethelatteraremostly43
experimentalapproaches.Somecomputationalapproachesuseinformationfromexistingpopulationgenomicdatabases44
andcanbeusedforphasingmultipleindividuals.These,however,maybeunabletocorrectlyphaserareandprivate45
variants,whicharenotrepresentedinthereferencedatabaseused.Ontheotherhand,somemethodsuseinformation46
fromparentsorcloselyrelatedindividuals.ThesehavetheadvantageofbeingabletouseIdentical-By-Descent(IBD)47
information,andallowlongrangephasing,butrequiresequencingofmoreindividuals,whichaddstothecost.Afew48
methodswhichusetheseapproachesare:PHASE6,fastPHASE7,BEAGLE8-9,SHAPEIT10-11,EAGLE12-13andIMPUTEv214. 49
Experimentalphasingmethods,ontheotherhand,ofteninvolveseparationofentirechromosomesfollowedby50
sequencingofshortsegments,whichcanthenbecomputationallyreconstructedtogenerateentirehaplotypes.These51
methodsdonotneedinformationfromindividualsotherthantheonebeingsequenced.Thesemethodsinvolve52
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
3
genotypingbeingperformedseparatelyfromphasing.Thesemethodsfallintotwobroadcategories,namelydenseand53
sparsemethods14.Densemethodsresolvehaplotypesinsmallblocksingreatdetail,whereallvariantsinaspecificregion54
arephased.However,theydonotinformthephaserelationshipbetweenthehaplotypeblocks.Theseinvolvediluting55
highmolecularweightDNAfragmentssuchthatfragmentsfromatmostonehaplotypearepresentineachunit.Sparse56
methodscanresolvephaserelationshipsacrosslargedistances,butmaynotinformonthephaseofeachvariantina57
chromosome.Inthesemethods,alownumberofwholechromosomesiscompartmentalizedsuchthatonlyoneofeach58
pairofhaplotypesispresentineachcompartment.Thesecompartmentalizationsarefollowedbysequencingtogenerate59
thehaplotypes. 60
Inthiswork,weusephasedhaplotypesgeneratedusingthe10XGenomicsmethodwhichuseslinked-readsequencing15.61
1nanogramofhighmolecularweightgenomicDNAisdistributedacross100,000droplets.ThisDNAisbarcodedand62
amplifiedusingpolymerase.ThistaggedDNAisreleasedfromthedropletsandundergoeslibrarypreparation.These63
librariesareprocessedviaIlluminashort-readsequencing.Acomputationalalgorithmisthenusedtoconstructphased64
haplotypesbasedonthebarcodes. 65
ImputationImputationinvolvesthepredictionofgenotypesnotdirectlyassayedinasampleofindividuals.66
Experimentallysequencinggenomestoahighcoverageisanexpensiveprocess.Lowcoveragesequencingorarrayscan67
beusedaslow-costmethodsforsequencing.However,thesemethodsmayleadtouncertaintyinestimatedgenotypes68
(lowcoveragesequencing)ormissinggenotypevaluesforuntypedsites(arrays).Imputationcanbeusedtoobtain69
genotypedataformissingpositionsusingreferencedataandknowndataatasubsetofpositionsinindividualswhich70
needtobeimputed.ImputationisusedtoboostthepowerofGWASstudies16,finemappingaparticularregionofa71
chromosome17,orperformingmeta-analysis18,whichinvolvescombiningreferencedatafrommultiplereferencepanels. 72
Imputationusesareferencepanelofknownhaplotypeswithallelesknownatahighdensityofhaplotypedpositions.A73
study/inferencepanelgenotypedatasparsesetofpositionsisusedforsequenceswhichneedtobeimputed.Performing74
imputationinvolvestwobasicsteps: 75
• Phasinggenotypesatgenotypedpositionsinthestudy/inferencepanel 76
• Haplotypesfromtheinferencepanelwhichmatchthoseinthereferencepanelatthepositionsinthestudypanel77
areassumedtomatchinallotherpositions 78
Variousimputationalgorithmsperformthesestepssequentiallyanditerativelyorsimultaneously. 79
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
4
Factorsaffectingthequalityofthephasingandimputationare(1)sizeofreferencepanel(2)densityofSNPsinreference80
panel(3)accuracyofcalledgenotypesinthereferencepanel(4)degreeofrelatednessbetweensequencesinreference81
panelandstudysequences(5)ethnicityofthestudyindividualsincomparisonwiththeavailablereferencedataand(6)82
allelefrequencyofthesitebeingphasedorimputed5. 83
Multiplemethodshavebeendevelopedforgenotypeimputation19.fastPHASE7,MACH20-21,BEAGLE8,22-23,andIMPUTEv21484
aresomewidelyusedmethodsforimputation. 85
AnanalysisoftheimputationaccuracyfortheHapMapprojecthasbeenperformedaboutadecadeago24,butnosimilar86
detailedanalysisexistsforassessingthephasingandimputationofthe1000Genomesproject,particularlycomparingthe87
databaseagainstexperimentallyphasedsequences.Wepresenthereadetailedassessmentofthequalityofphasingand88
imputationforthe1000Genomesdatabase,particularlyasafunctionofminorallelefrequencyandinter-SNPdistances89
forbiallelicSNPs.90
91
MATERIALANDMETHODS 92
InputData 93
ProcessedVCFsweredownloadedfromthe1000Genomeswebsite.Thisdataisavailableforeachchromosome94
separately.Toobtainagreementwiththeexperimentaldata,1000GenomesVCFscorrespondingtotheGRCh38assembly95
weredownloaded.Experimentaldatawassequencedusingthe10XGenomicsplatformfor28individuals:5GM,18HG,96
and5NA.TheGMandNAindividualswereoriginallypartoftheHapMapprojectwhiletheHGarefromthe100097
Genomesproject.ThirteenoftheseindividualswereprocessedatUCSFandsequencedatNovogene,whiletheremaining98
individualswereprocessedandsequencedatGenentech.Thepopulationsfromwhicheachoftheindividualscome(as99
listedintheCoriellCatalog)are: 100
• SouthAsia(SAS): 101
o GujaratiIndiansinHouston,Texas,USA(HapMap)[GIH]-GM21125*,NA20900,NA20902 102
o PunjabiinLahore,Pakistan[PJL]-HG03491,HG03619 103
o SriLankanTamilintheUK[STU]-HG03679,HG03752,HG03838* 104
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
5
o IndianTeluguintheUK[ITU]-HG03968 105
o BengaliinBangladesh[BEB]-HG04153,HG04155 106
• EastAsia(EAS): 107
o HanChineseinBeijing,China(HapMap)[CHB]-GM18552*,NA18570,NA18571 108
o ChineseDaiinXishuangbanna,China[CDX]-HG00851*,HG01802,HG01804 109
o KinhinHoChiMinhCity,Vietnam[KHV]-HG02064,HG02067 110
o JapaneseinTokyo,Japan(HapMap)[JPT]-NA19068* 111
• Africa(AFR): 112
o LuhyainWebuye,Kenya(HapMap)[LWK]-GM19440* 113
o GambianinWesternDivision,TheGambia[GWD]-HG02623* 114
o EsanfromNigeria[ESN]-HG03115* 115
• Europe(EUR): 116
o ToscaniinItalia(TuscansinItaly)(HapMap)[TSI]-GM20587* 117
o BritishfromEnglandandScotland,UK[GBR]-HG00250* 118
o FinnishinFinland[FIN]-HG00353* 119
• America(AMR): 120
o MexicanAncestryinLosAngeles,California,USA(HapMap)[MXL]-GM19789* 121
o PeruvianinLima,Peru[PEL]-HG01971* 122
AsterisksnexttosampleIDsrefertosamplesprocessedatUCSF. 123
Preprocessing1000GenomesData 124
The1000GenomesdatawasseparatedintoindividualandchromosomespecificVCFsusingvcftools25.Further,the125
variantswerefilteredforbiallelicSNPs,phased,filteredforPASS,andindelswereremoved.Theexperimentallyphased126
dataalsohadaverysmallfractionofunphasedSNPs,whichwereremovedbyfilteringwithvcftools.Theanalysiswas127
performedonlyforautosomes. 128
PhasingAnalysis 129
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
6
Thealternate(ALT)allelefrequenciesofalltheSNPsofinterestwereobtainedfromthe1000Genomesdataand130
convertedtominorallelefrequenciestobeabletoanalyzeswitcherrorasafunctionofminorallelefrequencies.The131
filteredSNPsfromtheexperimentaldataweresplitintophasesets,basedonphasesetinformationavailableinthe132
experimentalVCFfiles.Switcherrorwascalculatedbetweentheexperimentaland1000Genomesdataforeachphaseset133
ineachchromosomeofeachindividualfromtheexperimentaldataset.Switcherrorisdefinedaspercentageofpossible134
switchesinhaplotypeorientationusedtorecoverthecorrectphaseinanindividual26orproportionofheterozygouspositions135
whosephaseiswronglyinferredrelativetothepreviousheterozygousposition27.vcftoolsreturnstheswitcherroraswellas136
allpositionsofswitchesoccurringalongthechromosome. 137
SwitchErrorasaFunctionofMinorAlleleFrequencyALTallelefrequencieswereaccessedforeachoftheswitch138
positionsfromthedataandwereconvertedtominorallelefrequencies.Distributionofswitchpositionsasafunctionof139
minorallelefrequencywasplottedforeachchromosomeineachindividual. 140
SwitchErrorasaFunctionofInterSNPDistancePositionsofeachSNPwereaccessedfromthedata.Thenumberof141
intermediateswitcheswerecountedforallpairofSNPs,notonlyconsecutiveSNPs.Ifthenumberofswitchesbetween142
twoSNPswereodd,aswitcherrorwascounted.Thiswasusedtocalculatethedistributionofswitcherrorsasafunction143
ofinter-SNPdistance. 144
145
ImputationAnalysis 146
Theentireimputationanalysisisperformedforeachchromosomeforeachindividual. 147
GenerateRecombinationMapIMPUTEv213makesavailablerecombinationmapsforeachchromosomeusingthe1000148
GenomesdatafortheGRCh37assembly.ArecombinationmapwasobtainedforeachchromosomeforGRCh38bylifting149
overtheGRCh37mapsusingtheliftoversoftware.~8kpositions(0.2%)wereremovedfromtheliftedover150
recombinationmapbecauseliftoverresultedinthembeingintheincorrectorder. 151
GenerateReferencePanelAreferencehaplotypepanelwasgeneratedforallindividualsfromthe1000Genomesdataby152
subsettingittothespecificpopulationofinterest.1000Genomesdatafortheindividualswhichwereexperimentally153
sequencedwasnotincludedinthereferencepanel.vcftoolswasusedtofilterouttheindividualsofinterestfromthe1000154
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
7
Genomesdata.bcftoolswasusedtoconverttheVCFdatatohaps-sample-legendformat.Analternateapproachwasalso155
used,wheretheentire1000Genomesdatawasusedtogenerateareferencehaplotypepanel. 156
GenerateStudyPanelAstudypanelwasgeneratedfortheexperimentallysequencedindividualsselected.Thestudy157
panelisassumedtobegenotypedatpositionscorrespondingtotheIlluminaInfiniumOmni2.5-8array.Arraypositions158
wereliftedoverfromGRCh37toGRCh38usingliftover.1000Genomeshaplotypes(since1000Genomesdatais159
prephased,thestudypanelisalsointheformofhaplotypesratherthangenotypes)forthosepositionsforthose160
individualswereselectedtocreatethestudypanelusingvcftools.FilteredVCFfileswereconvertedtothehaps-sample161
formatusingbcftools. 162
RunImputationMissingpositionsareimputedusingIMPUTEv2.Imputationwasperformedin5Mbwindows.The163
genotypeoutputbyimputationwasconvertedtoVCFformatusingbcftools.VCFsproducedoverallwindowswere164
combinedusingvcf-concat.IMPUTEv2generallyphasesthetypedgenotypedsitesinstudypanel.Thisisfollowedby165
imputationwhichisperformedbyassumingthathaplotypesinthestudypanelthatmatchthehaplotypesinthereference166
panelatthetypedsitesalsomatchintheuntypedsites.IMPUTEv2thenperformsaniterativeprocessperforming167
multipleMonte-Carlostepsalternatingphasingandimputation.Forthisanalysis,however,ashaplotypesfromthe1000168
Genomesprojectweredirectlyusedtogeneratethestudypanel,thephasingstepwasnotperformed. 169
FilterPositionsForonepartoftheanalysis,i.e.estimatingerrorsinthepositionsrepresentedintheexperimentally170
phasedVCFs(henceforthcalledexperimentalSNPs),thepositionsfromthoseVCFswerefilteredfromtheimputeddata171
usingvcftools.ExperimentalgenotypesfromtheexperimentalVCFswereobtainedforeachindividualofinterestusing172
vcftools.SNPswithduplicateentriesineithertheimputedorexperimentaldatawereremoved.Continent-specificallele173
frequencieswereobtainedfortheexperimentalSNPsfromthe1000Genomesdatausingvcftools,tobeabletoanalyze174
switcherrorasafunctionofMinorAlleleFrequencies.Fortheotherpartoftheanalysis,i.e.estimatingerrorsforall175
positionsinthe1000Genomesdata,theallelefractionsweresimilarlyobtainedforalloftheSNPs. 176
ImputationErrorImputationerrorwascomputedasfractionofgenotypesbeingincorrectlyidentified.Imputationerror177
wascomputedforboth,theSNPsintheexperimentaldataandalltheSNPsin1000Genomesdata.Erroriscomputedasa178
functionofminorallelefrequency.Thecontinent-specificminorallelefrequencieswereusedforanalyzingtheimputation179
error.180
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
8
Forallanalysiswhereerrorrateiscomputedasafunctionofthecontinent-specificminorallelefrequency(genotyping181
errorandimputationerror;Figs.1,2,7,8),theminorallelefrequenciesarebinnedasMAF=0.0%,0.0%<MAF<0.2%,182
0.2%<=MAF<0.5%,0.5%<=MAF<1%,1%<=MAF<5%,MAF>=5%.Fortheanalysiswhereall1000Genomesminor183
allelefrequenciesareused(phasingerrorandimputationerrorcomparinguseofmultiplereferencepanels;Figs.3,4,9),184
theminorallelefrequenciesarebinnedintoonlyfivebins,i.e.thereisnoMAF=0.0%bin.Restofthebinsarethesameas185
forthecontinent-specificMAFbins. 186
ExperimentalMethods 187
Samplesprocessing:HMWGenomicDNAwasextractedandconvertedinto10xsequencinglibrariesaccordingtothe188
10XGenomics(Pleasanton,CA,USA)ChromiumGenomeUserGuideandaspublishedpreviously28.Briefly,GEMSwere189
madewith1.25ngHMWtemplategDNA,Master-mixGenomeGelBeadsandpartitioningoilonthemicrofluidicGenome190
Chip.IsothermalincubationoftheGEMs(for3hat30°C;for10minat65°C;storedat4°C)producedbarcodedfragments191
rangingfromafewtoseveralhundredbasepairs.AfterdissolutionoftheGenomeGelBeadintheGEMIlluminaRead1192
sequencingprimer,16bp10xbarcodeand6bprandomprimerarereleased.TheGEMswerethenbrokenandthepooled193
fractionswererecovered.SilaneandSolidPhaseReversibleImmobilization(SPRI)beadswereusedtopurifyandsize194
selectthefragmentsforlibrarypreparation.Libraryprepwasperformedaccordingtothemanufacturer'sinstructions195
describedintheChromiumGenomeUserGuideRevC.Librariesweremadeusing10xGenomicsadapters.Thefinal196
librariescontaintheP5andP7primersusedinIlluminabridgeamplification.Thebarcodedlibrarieswerethenquantified197
byqPCR(KAPABiosystemsLibraryQuantificationKitforIlluminaplatforms).SequencingwasdoneusingIlluminaHiSeq198
4000with2×150paired-endreads.Rawreadswereprocessed,alignedtothereferencegenome,andhadSNPscalledand199
phasedusing10XGenomics’LongRangersoftware(version2.1.1or2.1.6)withthe“wgs”pipelinewithdefaultsettings.200
RESULTS 201
The1000Genomesprojectchromosome-specificVCFsfortheGRCh38assemblycontainbetween6.4M(chr1)to1.1M202
(chr22)variantsoverallthe2504individuals.AfterfilteringforbiallelicSNPs,phased,filteredforPASS,removingindels,203
weareleftwith6.15M(chr1)to1.05M(chr22)variants.Theexperimentallyphaseddatafromthe10XGenomicsplatform204
hasdifferentnumbersofcalledvariantsforeachsequencedindividual.Forchromosome1,thenumberofcalledvariants205
variesfrom414Kto494Kacrossthe28individuals,while,forchromosome22,thenumberofcalledSNPsvariesfrom206
104Kto120K.Afterperformingasimilarfilteringfortheexperimentaldata,thenumberofbiallelicPASSphasedSNPs207
rangesbetween298Kand357Kforchromosome1and64Kand75Kforchromosome22. 208
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
9
209 (a)(b)210
Figure1DistributionofSNPsasafunctionofcontinent-specificminorallelefrequencies(a)onlyexperimentalSNPs(b)all1000211 GenomesSNPs 212
TheSNPsfromtheexperimentallyphasedVCFs(Fig.1a),averagedovercontinentgroupsshowthatthevastmajorityof213
SNPsinthisselectionhavehighcontinent-specificMAFvalues(>5%).Comparingacrosscontinentsforthecontinent214
invariantSNPs,theAfricanandAmericanindividualshaveanorderofmagnitudelesscontinentinvariantSNPsthanthe215
European,EastAsianandSouthAsianindividuals.However,ifwelookatalltheSNPsinthe1000GenomesData(filtered216
forbiallelicPASSphasedSNPs)asafunctionofcontinent-specificMAF,thedistributionweobservehasaverydifferent217
trend.Thereisasignificantover-representationoftheverylowcontinent-specificMAFSNPs(<0.1%),∼5∗107,as218
comparedtoallthesubsequenthigherMAFSNPs,whichallrange<1∗107. 219
Thesediscrepanciesbetweenthenumbersinthe1000Genomesdataandintheexperimentallyphaseddata,aswellas220
thedifferingtrendsasafunctionofMAFoccurbecausethe1000GenomesdataincludesaSNPifevenoneindividualin221
the2504individualshasavariant(heterozygousorhomozygous-alternate)atthatpositionwhiletheexperimentaldata222
includesaSNPonlyifthatparticularindividualhasavariant(heterozygousorhomozygous-alternate)atthatposition.223
ThisresultsinamuchlargernumberofoverallSNPsbeingpresentinthe1000Genomesdataascomparedtothe224
experimentalandalsothemajorityofthe1000GenomesSNPshavingextremelylowMAF,asthosewouldoccuronlyin225
oneorafewindividuals.226
GenotypingError227
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
10
228 (a)(b)229 Figure2Genotypingerror(a)intheexperimentalVCFpositionsasafunctionofcontinent-specificminorallelefrequencyaveragedover230 allchromosomesoverallindividualsineachcontinent(b)falsepositivevsfalsenegativerates(definedintext)forall1000Genomes231 SNPs 232
Genotypingerroriscomputedcomparingthe1000Genomesgenotypeswiththeexperimentalgenotypes.The233
experimentalgenotypesforallSNPsnotpresentintheexperimentalVCFforeachindividualareassumedtobe234
homozygousreference.Mismatchedgenotypesarecountedaserrors.Figure2alooksattheerrors(fractionofgenotypes235
whichareincorrect)fortheexperimentalVCFpositionsasafunctionofthecontinent-specificminorallelefrequencies.236
Thereishighererroratthepopulationinvariantsites(MAF=0.0%)intheAfricanandAmericanpopulationsthanthe237
European,EastAsianandSouthAsianpopulations.Thiscorrelateswithalowertotalnumberofpopulationinvariant238
SNPsinthosecontinents(Fig.1a).Fornon-invariantSNPs,weobserve,asexpected,adecreasingerrorratewith239
increasingminorallelefrequency,toa<1.5%errorgenotypingerrorratefortheSNPswithminorallelefrequencies>1%.240
Comparingfalsepositive(sitesnon-homozygousreferencein1000Genomesdataandhomozygousreferenceinthe241
experimentaldata)vsfalsenegative(siteshomozygousreferencein1000genomesdataandnon-homozygousreference242
intheexperimentaldata)errorratesforall1000Genomessites(Fig.1b),weseethatthegenotypingfortheEuropean243
andAmericanindividualsisveryaccurate,withbothlowfalsepositiveandfalsenegativerates.TheEastAsianandSouth244
Asianpopulationsbothhavemostlylowfalsepositiverates,butshowawiderange(factorof2)offalsenegativerates,245
whileshowingonlya~15%variationinthefalsepositiveratesformostindividuals.Incontrast,theAfricanindividuals246
mostlyhaverelativelylowfalsenegativerates,buthaveamongthehighestfalsepositiverates.Thisindicatesthatthe247
sequencinginthe1000Genomesprojecthasovercallednon-homozygousreferencevariantsinAfricanindividuals248
comparedtotherest,andovercalledSNPsashomozygousreferenceinsomeoftheEastandSouthAsianindividuals. 249
Phasing 250
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
11
251
Figure3SwitcherrorasafunctionofMinorAlleleFrequenciesfordifferentindividualchromosomes.Chromosome21showshigher252 switcherrorforlargeMAFvalues 253
Phasingerrorsareallanalyzedforoverall1000Genomesminorallelefrequencies,notcontinentspecificMAFs.254
Comparingtheswitcherroracrossindividualchromosomes(Fig.3),weobservethattheswitcherrorrangesbetween25255
−30%fortherareMAF(<0.1%)SNPs,fallingto<5%forSNPswithMAFs1−5%.ThemajorityofSNPs,whichfallinthe256
MAF>5%category,haveanerror<2.5%.However,acomparativelyhigherswitcherroratlargerMAFvalues(>5%)is257
observedforchromosome21.Thisplot(Fig.3)showsonlyasubsetofchromosomesasingleindividual(GM18552),but258
thistrendisobservedforallotherchromosomesandindividualsstudied.259
260 (a)(b)(c)261
Figure4Switcherror(a)Totalswitcherror(numberofswitchesinexperimentalSNPs/totalnumberofexperimentalSNPs)foreach262 individual(b)SwitcherrorasafunctionofMinorAlleleFrequenciesaveragedoverallindividualsineachcontinent.(c)Switcherrorasa263 functionofMinorAlleleFrequenciesforallindividualscoloredbycontinent.264
Figure4ashowsthetotalswitcherrorforeachoftheindividuals.Thetotalswitcherrorsforalltheindividualsstudiedgo265
upto∼2.5%.TheswitcherrorsfortheEastAsianindividualsaregroupedtogether,whilethosefortheSouthAsian266
individualsshowgreatervariability.ThisisinlinewiththegeneralobservationthatSouthAsianpopulationshavean267
overallgreaterheterogeneitythandoEastAsianpopulations[J.Wall,Unpublisheddata] 268
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
12
Analyzingtheswitcherrorasafunctionofminorallelefrequencyaveragedoverallchromosomesofallindividualsofa269
population(Fig.4b),weobservelowswitcherror,<5%,forlowminorallelefrequencies(MAF)(1−5%).ForrareSNPs270
withMAF(0.2−1%),theswitcherroris∼5−10%.ForextremelyrareminoralleleSNPs,i.e.MAF<0.2%,theerroris271
muchhigher,i.e.15−35%.ForallhigherMAFvalues(>5%),theerroris<2.5%.Theaverageerrorrateforthe272
individualsfromtheAfricanpopulationsisalmostthesameovertherangeofMAFvalues>0.1%. 273
AsobservedinFigure4c,thedifferencesintheerrorratesbetweenindividualsdecreasewithincreasingminorallele274
frequency.IndividualsfromSouthAsiashowalargervariationinerrorasafunctionofMAFascomparedtoindividuals275
fromEastAsia.TheindividualsfromtheAfricanpopulationshavethelowestswitcherrorovertherangeofMAFvalues.276
IndividualNA20900,anindividualfromtheGujaratiIndiansinHouston(GIH)populationhasthelowestswitcherrorasa277
functionofminorallelefrequencyforthelowMAFSNPs. 278
279
280 (a) (b)281
Figure5Switcherrorasafunctionofinter-SNPdistance(a)Switcherrorasafunctionofinter-SNPdistancesaveragedoverindividuals282 ineachcontinent.(b)Switcherrorasafunctionofinter-SNPdistancesforallindividualscoloredbycontinent. 283
WealsoanalyzedphasingerrorasafunctionofthedistancesbetweenSNPs(Fig.5).Thephasingerrorincreasesasa284
functionoftheinter-SNPdistance,i.e.SNPswhicharefurtherapartaremorelikelytobeoutofphasewitheachother.The285
withinpopulationtrendsarethesameasforswitcherrorvsMAF,wheretheindividualsfromSouthAsiashowalarger286
spreadascomparedtotheindividualsfromEastAsia.IndividualNA20900showsthelowesterrorrate,sameasforthe287
comparisonoferrorvsMAF(Fig.4c). 288
ComparingtheswitcherrorasafunctionofMAFvs.theswitcherrorasafunctionofinter-SNPdistance,weseethatthe289
individualsfromtheAfricanpopulationsshowdistinctlyoppositetrends.ForlowMAFSNPs,theerroristhelowest290
averagingovertheAfricanindividuals,whileacrosstherangeofinter-SNPdistances,theaverageovertheAfrican291
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
13
individualswasthehighesterror.Thereasonthisoccurscanbeunderstoodfromthefactthatthereareahighernumber292
oflowMAFSNPsintheAfricanindividualsintheexperimentaldata(Fig.1a),aswellasanoverallhighernumberofSNPs293
inthoseindividuals,leadingtoahigherSNPdensityfortheseindividuals.Inaddition,thereislesslinkagedisequilibrium294
(LD)intheindividualsfromtheAfricanpopulations,whichwouldmakeithardertophasethemaccurately29-30.Hence,295
pairsofSNPsaremorelikelytobeoutofphasewitheachother,leadingtohigherswitcherrorasafunctionofinter-SNP296
distance. 297
Imputation 298
299 (a)(b)300
Figure6Totalimputationerror(a)TotalimputationerrorinexperimentalSNPs(numberofincorrectgenotypesinallexperimental301 SNPs/totalnumberofexperimentalSNPs)foreachindividual(b)Totalimputationerrorinall1KGSNPs(numberofincorrectgenotypes302 inall1KGSNPs/totalnumberof1KGSNPs)foreachindividual 303
ImputationerroriscomputedasthefractionofSNPswithincorrectlyimputedgenotypes.However,dependingonthe304
subsetofSNPsunderconsideration,theerrorcanbecomputedintwodifferentways,(1)fractionofexperimentalSNPs305
incorrectlyimputedand(2)fractionofall1KGSNPsincorrectlyimputed.Inthecaseoftheseconddefinitionoferror,the306
experimentalcallsforallthepositionsnotintheexperimentalVCFsaresettohomozygous-reference. 307
Figure6ashowsthetotalimputationerrorintheexperimentalSNPswhileFigure6bshowsthetotalimputationerrorin308
the1KGSNPsforeachoftheindividuals.ThetotalimputationerrorsintheexperimentalSNPsforalltheindividuals309
studiedgoupto∼4%.ForthissubsetofSNPs,thetwoAmericanindividualshavetheamongthehighestimputation310
errors.TheimputationerrorsfortheEastAsianindividualsaregroupedtogether,whilethosefortheSouthAsian311
individualsshowgreatervariability.Thisagreeswithourobservationsfortheswitcherror(Fig.4a).Inthe1KGSNPs,on312
theotherhand,sincewearelookingatamuchlargersetofSNPs,mostofwhicharehomozygous-referenceinanygiven313
individual,weseeamuchsmallererror<∼1%.314
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
14
315 (a)(b)316
Figure7ImputationaccuracyexperimentalVCFpositions(a)ImputationerrorintheexperimentalSNPsasafunctionofMinorAllele317 Frequenciesaveragedoverindividualsineachcontinent.(b)ImputationerrorintheexperimentalSNPsasafunctionofMinorAllele318 Frequenciesforallindividualscoloredbycontinent. 319
ImputationerrorinexperimentalSNPsFigure7ashowsawiderangeoferrorratesasfunctionofthecontinent-320
specificminorallelefrequency.Thecontinentinvariantpositions(MAF=0.0%)areimputedalmostasaccuratelyasthe321
highMAF(>5%in3populations,and>1%intwopopulations)SNPs.Inthesepositions,wemakethesameobservationas322
wedidfortheoriginalgenotypinginthe1000genomesreferencedata(Fig.2a),i.e.theerrorsintheEuropean,EastAsian323
andSouthAsianindividualsforthesecontinentinvariantpositionsarelowerthanthosefortheAmericanandAfrican324
individuals.FortheveryrareSNPs,i.e.MAF<0.2%,theerrorisashighas∼60%.Theseextremelyhigherrorratesare325
onlyobservedintheAmericanindividualsandafewoftheSouthAsianindividuals.Fortherestoftheindividuals,the326
errorratesare<50%.Inthemid-rangeofMAFvalues,i.e.0.2%to1%,theerrorsrangebetween10−20%.TheSNPswith327
higherMAFvaluesarefairlyaccurate,witherrors<2%forcommonSNPs(MAF>5%).Thiscanalsobeseenlookingatall328
theindividualsseparately(Fig.7b).TheSouthAsian(GujaratiinHouston,Texas)individualNA20900stillshowsthe329
lowesterrorrateasafunctionofMAFforimputation,justasitdoesfortheswitcherror(Fig.4c).330
331
(a)(b)332
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
15
Figure8Imputationaccuracyall1KGSNPs(a)Imputationerrorinallthe1000GenomespositionsasafunctionofMinorAllele333 Frequenciesaveragedoverindividualsineachcontinent.(b)Imputationerrorinallthe1000GenomespositionsasafunctionofMinor334 AlleleFrequenciesforallindividualscoloredbycontinent. 335
Imputationerrorinall1KGSNPsComputingtheerrorusingallthe1KGSNPs,weseeadifferenttrendfortheerrorsas336
afunctionofminorallelefrequency(Figs.8a,8b).Theinvariantsiteshaveverylowerrors~10-4.Forthevariantsites,the337
errorsincreaseasafunctionofminorallelefrequency,asopposedtodecreasingastheydointheexperimentalonlySNPs.338
ThereasonthishappensisthatcontrastingthenumberofexperimentalSNPs(Fig.1a)withthenumbersofall1KG339
SNPs(Fig.1b),whilethenumberoflowMAFSNPsis1-2ordersofmagnitudelessthanthenumberofSNPswithMAF>340
5%intheexperimentaldata,thenumberofverylowMAFSNPsis2-10timesgreaterthanthenumberofSNPswithMAF341
>5%inthewhole1000Genomesdata.ThevastmajorityoftheverylowMAFSNPsinthewhole1000Genomesdataare342
homozygous-reference,sincethoseSNPsshowvariationinonlyoneorveryfew1000Genomesindividuals.Hence,343
imputationpredictionsgetmostofthosepositionscorrectinmostoftheindividuals.Asaresult,thefractionofthosevery344
rareSNPswhicharepredictedincorrectlyismuchlowerwhenconsideringallthe1000GenomesSNPsascomparedto345
onlyconsideringtheexperimentalSNPs,wheremostoftheSNPsarehighMAFSNPs. 346
ConsistentwiththeobservationsfortheexperimentalonlySNPs,atveryrareSNPs(MAF<0.2%),theAmerican347
individualsstillhavethehighesterrorrate.TheindividualsfromtheSouthAsianpopulationsstillshowagreaterspread348
thanthosefromtheEastAsianpopulations.IndividualNA20900stillshowsthelowesterrorrateaswithprevious349
observations.350
351 (a)(b)352
Figure9ImputationerrorasafunctionofMinorAlleleFrequenciesforEuropeanindividualscomparingtheEuropeanreferencepanel353 v/stheentire1KGreferencepanel(a)experimentalSNPs(b)All1000GenomesSNPs 354
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
16
ComparisonofreferencepanelsHere,wecomparetheimputationerrorsresultingfromusingdifferentreference355
panelsforimputation.Acontinent-specificreferencepanelfortheindividualofinterest,areferencepanelwhichincludes356
allofthe1000Genomesindividuals,andacontinent-specificreferencepanelforadifferentcontinentfromtheonefrom357
whichtheindividualsare,arechosen.Theminorallelefrequenciesusedhereareforalltheoverall1000Genomesminor358
allelefrequencies,insteadofacontinent-specificminorallelefrequency,sincewewanttounderstandtheimpactofthe359
choiceofreferencepanel,andcontinent-specificMAFswouldnotalignwiththewholereferenceorthereferencefrom360
anothercontinent.Inthiscase,welookattheimputationerrorinthe3Europeanindividualswhenimputationiscarried361
outwiththeEuropeanreference,theSouthAsianreference,andthewhole1000Genomesreference.362
TheobservedresultforexperimentalonlySNPs(Fig.8a)whencomparingreferencepanelsfortheEuropeanindividuals363
isverysimilarwhenlookingatall1000GenomesSNPs(Fig.8b).Theimputationaccuracywhenusingtheentire1000364
GenomesdataasareferencepanelgivesaslightlybetteraccuracythanusingjustaEuropeanspecificreferencepanel.365
Theerrorwhileusinganincorrectreferencepanel,however,isuptoafactorof2greaterthantheerrorwhenusingthe366
appropriatereference,orwhenusingthewhole1000Genomesreferencepanel.ThetrendoferrorasafunctionofMAFis,367
again,theoppositeofwhatwasobservedwhenlookingatonlytheexperimentalSNPs.368
DISCUSSION 369
The1000GenomesProjectdatahavebeenwidelyusedasareferenceforestimatingcontinent-specificallelefrequencies,370
andasareferencepanelforphasingandimputationstudies.Sincetheproject’sdesigninvolvedlow-coverage(~7X)371
sequencingformostofthesamples,itwasunknownapriorihowaccuratethe1KGP’sgenotypeandhaplotypecallswere,372
especiallyforrarevariants.Thisaccuracyobviouslydirectlyimpactstheusefulnessofthe1KGPdata.Withtheadventof373
inexpensive,commercialplatformsforexperimentallyphasingwholegenomes,itispossibletodirectlyquantifythe374
genotypeandhaplotypeerrorratesofthe1KGPdata.375
376
Ourcomparisonof28experimentallyphasedgenomeswiththe1KGPdatafoundthatthelatterishighlyaccuratefor377
commonandlow-frequencyvariants(i.e.,MAF≥0.01).Asexpected,accuracydeclinedwithdecreasingMAF,withrare378
variants(MAF<0.01)notreliablyimputedontohaplotypes.Surprisinglythough,thegenotypecallswerereasonably379
accurateevenforrarevariants.Thisobservationmaynotgeneralizetootherlow-coveragesequencingstudiesduetothe380
complicatedandlabor-intensiveprotocolusedforvariantcallinginthe1KGP.Weconcludethatthe1KGPdataisbest381
usedasareferencepanelforimputingvariantswithMAF≥0.01intopopulationscloselyrelatedtothe1KGPgroups,and382
isprobablyoflimitedutilityforimputationinrarevariantassociationstudies.Largersubsequentimputationpanels,such383
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
17
astheonegeneratedbytheHaplotypeReferenceConsortium(HRC)31,arelikelymuchmoreusefulforimputingrare384
variants,atleastinwell-studiedEuropeanpopulations.However,eventhislargereferencepanelmaybeoflimited385
usefulnessforimputationintootherhumangroups.Whileourresultssuggestthatusingaregion-specificreferencepanel386
(forthecorrectregion)forimputationisonlyslightlyworsethanusingaworldwidepanel,thechoiceofanincorrect387
regionalpanelmakestheimputationconsiderablyworse.So,largeEuropean-basedhaplotypereferencepanelswillbeof388
limitedutilityforimputingvariantsintoEastAsian,SouthAsian,orAfrican-Americangenomes,whileimputationstudies389
involvingunderstudiedgroupssuchasMiddleEasterners,MelanesiansorKhoisanarelikelytohaveerrorrates390
substantiallyhigherthanwhatwasobservedinourstudy.Thisisaconsequenceofthefactthatmostrarevariantsare391
region-specific;imputationonlyworkswhenthevariantbeingimputedshowsupoftenenoughinthereferencepanel.In392
summary,whilethe1KGPandHRCprovidevaluablegenomicresourcesthatcanaugmentthepowerofGWASingroups393
withEuropeanancestry,additionallarge-scalegenomesequencingofdiversehumanpopulationswillbenecessaryto394
obtaincomparablebenefitsofimputationingeneticassociationstudiesofnon-Europeangroups.395
396
Finally,wenotethattheabsoluteerrorratevariedbyanorderofmagnitude,dependingonthespecificdefinitionsof397
errorthatwereused.Thishighlightstheimportanceofdefinitionalclarityinstudiesthatevaluatetheaccuracyof398
genomicresources. 399
ConflictsofInterestDeclaration 400
GenentechauthorsholdsharesinRoche.Theotherauthorsdeclarenoconflictsofinterest.401
Acknowledgments402
JDWwassupportedinpartbyNIHgrantR01GM115433.SBwassupportedbyGenentechresearchgrantCA0095684.403
404
LITERATURECITED 405
1. The1000GenomesProjectConsortium(2010)Amapofhumangenomevariationfrompopulation-scale406
sequencing.Nature467,1061–1073. 407
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
18
2. The1000GenomesProjectConsortium(2012)Anintegratedmapofhumangeneticvariationfrom1092human408
genomes.Nature491,56–65. 409
3. The1000GenomesProjectConsortium(2015)Aglobalreferenceforhumangeneticvariation.Nature526,68–410
74. 411
4. Tewhey,R.,Bansal,V.,Torkamani,A.,Topol,E.J.,andSchork,N.J.(2011)Theimportanceofphaseinformation412
forhumangenomics.NatureReviewsGenetics12,215–223 413
5. Browning,S.andBrowning,B.(2011).Haplotypephasing:existingmethodsandnewdevelopments.Nature414
ReviewsGenetics12,703–714 415
6. Stephens,M.andScheet,P.(2005)Accountingfordecayoflinkagedisequilibriuminhaplotypeinferenceand416
missing-dataimputation.Am.J.Hum.Genet.76,449–462 417
7. Scheet,P.andStephens,M.(2006)Afastandflexiblestatisticalmodelforlarge-scalepopulationgenotypedata:418
applicationstoinferringmissinggenotypesandhaplotypicphase.Am.J.Hum.Genet.78,629–644 419
8. Browning,S.andBrowning,B.(2007).Rapidandaccuratehaplotypephasingandmissing-datainferencefor420
whole-genomeassociationstudiesbyuseoflocalizedhaplotypeclustering.Am.J.Hum.Genet.81,1084–1097 421
9. Browning,B.andBrowningS.(2009).Aunifiedapproachtogenotypeimputationandhaplotype-phase422
inferenceforlargedatasetsoftriosandunrelatedindividuals.Am.J.Hum.Genet.84,210–223 423
10. Delaneau,O.,Coulonges,C.,andZagury,J.-F.(2008).Shape-IT:newrapidandaccuratealgorithmforhaplotype424
inference.BMCBioinformatics9 425
11. Delaneau,O.,Marchini,J.,andZagury,J.-F.(2012)Alinearcomplexityphasingmethodforthousandsofgenomes.426
NatureMethods9,179–181 427
12. Loh,P.-R.,Danecek,P.,Palamara,P.F.,Fuchsberger,C.,ReshefY.A.,Finucane,H.K.Schoenherr,S.,Forer,L.,428
McCarthy,S.,Abecasis,G.R.,etal.,(2016)Reference-basedphasingusingthehaplotypereferenceconsortium429
panel.NatureGenetics48,1443–1448 430
13. Loh,P.-R.,Palamara,P.F.,andPrice,A.L.(2016)Fastandaccuratelong-rangephasinginaUKbiobankcohort.431
NatureGenetics48,811–816 432
14. Howie,B.N.,Donnelly,P.,andMarchini,J.(2009)Aflexibleandaccurategenotypeimputationmethodforthe433
nextgenerationofgenome-wideassociationstudies.PLOSGenetics5 434
15. Synder,M.W.,Adey,A.,Kitzman,J.O.,andShendure,J.(2015)Haplotype-resolvedgenomesequencing:435
experimentalmethodsandapplications.NatureReviewsGenetics16,344–358 436
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
19
16. Zheng,G.X.Y.,Lau,B.T.,Schnall-Levin,M.,Jarosz,M.,Bell,J.M.,Hindson,C.M.,Kyriazopoulou-Panagiotopoulou,437
S.,Masquelier,D.A.,Merrill,L.,Terry,J.M.,etal.,(2016)Haplotypinggermlineandcancergenomeswithhigh-438
throughputlinked-readsequencing.NatureBiotechnology34,303–311 439
17. Spencer,C.C.A.,Su,Z.,Donnelly,P.,andMarchini,J.(2009)Designinggenome-wideassociationstudies:sample440
size,power,imputation,andthechoiceofgenotypingchip.PLoSGenetics5 441
18. Marchini,J.,Howie,B.,Myers,S.,McVean,G.,andDonnelly,P.(2007)Anewmultipointmethodsforgenome-442
wideassociationstudiesbyimputationofgenotypes.NatureGenetics37,906–913 443
19. Zeggini,E.,Scott,L.J.,Saxena,R.,Voight,B.F.,Marchini,J.L.,Hu,T.,deBakker,P.I.,Abecasis,G.R.,Almgren,P.,444
Andersen,G.,etal.(2008)Meta-analysisofgenomewideassociationdataandlarge-scalereplicationidentifies445
additionalsusceptibilitylocifortype2diabetes.NatureGenetics40,638–645 446
20. Marchini,J.andHowie,B.(2010)Genotypeimputationforgenome-wideassociationstudies.NatureReviews447
Genetics11,499–511 448
21. Li,Y.,Willer,C.J.,Sanna,S.,andAbecasis,G.R.(2009)Genotypeimputation.Annu.Rev.GenomicsHum.Genet.449
10,387–406450
22. Li,Y.,Willer,C.J.,Ding,J.,Scheet,P.,andAbecasis,G.R.(2010)Mach:usingsequenceandgenotypedatato451
estimatehaplotypesandunobservedgenotypes.Genet.Epidemiol.34,816–834 452
23. Browning,S.(2006).Multilocusassociationmappingusingvariable-lengthmarkovchains.Am.J.Hum.Genet.78,453
903–913 454
24. Huang,L.,Li,Y.,Singleton,A.B.,Hardy,J.A.,Abecasis,G.,RosenbergN.A.,andScheet,P.(2009)Genotype-455
imputationaccuracyacrossworldwidehumanpopulations.Am.J.Hum.Genet.84,235–250 456
25. Danecek,P.,Auton,A.,Abecasis,G.,Albers,C.A.,Banks,E.,DePristo,M.A.,Handsaker,R.E.,Lunter,G.,Marth,G.457
T.,Sherry,S.T.,etal.(2011).Thevariantcallformatandvcftools.Bioinformatics27,2156–2158. 458
26. Marchini,J.,Cutler,D.,Patterson,N.,Stephens,M.,Eskin,E.,Halperin,E.andLin,S.,Qin,Zhaohui,S.Q.,Munro,H.459
M.Abecasis,G.R.,etal.,(2006)Acomparisonofphasingalgorithmsfortriosandunrelatedindividuals.Am.J.460
Hum.Genet.78,437–450 461
27. Stephens,M.andDonnelly,P.(2003)Acomparisonofbayesianmethodsforhaplotypereconstructionfrom462
populationgenotypedata.Am.J.Hum.Genet.73,1162–1169 463
28. Weisenfeld,N.I.,Kumar,V.,Shah,P.,Church,D.M.,Jaffe,D.B.(2017)Directdeterminationofdiploidgenome464
sequences.GenomeResearch27(5),757-767 465
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint
20
29. Frisse,L.,Hudson,R.R.,Bartoszewicz,A.,Wall,J.D.,Donfack,J.,DiRienzo,A.(2001)Geneconversionand466
differentpopulationhistoriesmayexplainthecontrastbetweenpolymorphismandlinkagedisequilibrium467
levels.Am.J.Hum.Genet.69,831–843468
30. Gabriel,S.B.,Schaffner,S.F.,Nguyen,H.,Moore,J.M.,Roy,J.,Blumenstiel,B.,Higgins,J.,DeFelice,M.,Lochner,A.,469
Faggart,M.,etal.(2002)Thestructureofhaplotypeblocksinthehumangenome.Science296,2225–2229470
31. McCarthy,S.,Das,S.,Kretzschmar,W.,Delaneau,O.,Wood,A.R.,Teumer,A.,Kang,H.M.,Fuchsberger,C.,471
Danecek,P.,Sharp,K.,etal.(2016)Areferencepanelof64,976haplotypesforgenotypeimputation.Nature472
Genetics48(10),1279-1283473
.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint