ukb : ukb homepage - imputation documentation...

13
1 UK Biobank Phasing and Imputation Documentation Version 1.2 13 November 2015 documentation author Jonathan Marchini Department of Statistics, University of Oxford on behalf of UK Biobank Contributors to UK Biobank Phasing and Imputation Jonathan Marchini (Statistics Dept, Oxford), Jared O’Connell (WTCHG, Oxford), Olivier Delaneau (University of Geneva), Kevin Sharp (Statistics Dept, Oxford), Warren Kretzschmar (WTCHG, Oxford), Gavin Band (WTCHG, Oxford), Shane McCarthy (WTSI, Hinxton), Desislava Petkova (WTCHG, Oxford), Claire Bycroft (WTCHG, Oxford), Colin Freeman (WTCHG, Oxford), Peter Donnelly (WTCHG, Oxford).

Upload: others

Post on 06-Apr-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

1

UKBiobankPhasingandImputationDocumentation

Version1.2

13November2015

documentationauthorJonathanMarchiniDepartmentofStatistics,UniversityofOxford

onbehalfofUKBiobank

ContributorstoUKBiobankPhasingandImputationJonathanMarchini(StatisticsDept,Oxford),JaredO’Connell(WTCHG,Oxford),OlivierDelaneau(UniversityofGeneva),KevinSharp(StatisticsDept,Oxford),WarrenKretzschmar(WTCHG,Oxford),GavinBand(WTCHG,Oxford),ShaneMcCarthy(WTSI,Hinxton),DesislavaPetkova(WTCHG,Oxford),ClaireBycroft(WTCHG,Oxford),ColinFreeman(WTCHG,Oxford),PeterDonnelly(WTCHG,Oxford).

Page 2: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

2

TableofContents

Introduction.............................................................................................................................3

Phasing......................................................................................................................................4Filteringbeforephasing...............................................................................................................4Phasingmethoddescription.......................................................................................................4Validationofthephasingmethod.............................................................................................5Wholegenomephasing.................................................................................................................5

Genotypeimputation............................................................................................................6AssessmentoftheUKBiobankArrayforimputation........................................................6Referencepanelusedforimputation......................................................................................7Imputationmethoddescription................................................................................................8Wholegenomeimputation..........................................................................................................8Informationscores,minorallelefrequenciesandfiltering.............................................8Imputedgenotypefiles.................................................................................................................9Samplefiles....................................................................................................................................................10

Differencesbetweenrawgenotypesandimputedfiles...................................................10Anexemplargenomewideassociationstudy...........................................................11Samplefiltering.............................................................................................................................11Takingaccountofthedifferentarraysused.......................................................................11Associationtesting.......................................................................................................................11Results..............................................................................................................................................12

Fileprocessing.....................................................................................................................12

References.............................................................................................................................13

Page 3: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

3

IntroductionThisdocumentdescribestheanalysiscarriedouttoperformgenotypeimputationfortheinterimreleaseoftheUKBiobank(UKB)genotypedata.Italsoprovidesadviceaboutusingtheimputeddatatocarryoutgenomewideassociationstudies(GWAS)orforextractinggenotypesforuseascovariatesinothertypesofassociationstudy.

Genotypeimputation1,2istheprocessofpredictinggenotypesthatarenotdirectlyassayedinasampleofindividuals.AreferencepanelofhaplotypesatadensesetofSNPs,indelsandstructuralvariants,isusedtoimputegenotypesintoastudysampleofindividualsthathavebeengenotypedatasubsetoftheSNPs.These‘insilico’genotypescanthenbeusedtoboostthenumberofSNPsthatcanbetestedforassociation.Thisincreasesthepowerofthestudy,theabilitytoresolveorfine-mapthecausalvariantsandfacilitatesmeta-analysis.Theresultoftheimputationprocessisadatasetwith73,355,667SNPs,shortindelsandlargestructuralvariantsin152,249individuals.SeeBox1of1foraquickvisualoverviewofhowgenotypeimputationworks.

Theprocessofimputationisdividedintotwosteps(i)pre-phasing,and(ii)imputation.Inthefirststep,thesamplestobeimputedare‘pre-phased’i.eastatisticalmethodisappliedtogenotypedatatoinfertheunderlyinghaplotypesofeachindividual.Inthesecondstep,adifferentstatisticalmethodisusedtocombinetheinferredhaplotypeswithareferencepanelofhaplotypesandimputetheunobservedgenotypesineachsample.Thefollowingtwosectionsofthisdocumentdescribehowthepre-phasingandimputationwascarriedoutonthe~150,000samples.

Phasingandimputationcanbeacomputationallyintensiveprocess.Toavoidmanydifferentresearchgroupshavingtocarrythisoutindependently,phasingandimputationwasbeencarriedoutcentrally.QuestionsaboutusingtheimputedgenotypesshouldbesenttotheUKBGeneticsmaillistsetupforthispurpose.Youcansubscribetothemaillistherehttps://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=UKB-GENETICS

Page 4: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

4

Phasing

FilteringbeforephasingTocreateaninputdataforthephasingweappliedSNPQCfiltersasdescribedinUKBiobankQCdocumention3.Thesamplesweregenotypedontwoslightlydifferentchips.Approximately50,000weregenotypedaspartoftheULBiLEVEstudyusingachipdesignedforthatstudy(denotedUKBL),withtheremainingsamples(~100,000)genotypedontheUKBchip.Therefore,weapplieddifferentmissingnessfiltersonSNPsdependentuponchip.SNPswereremovedbasedonthenumberofbatchesinwhichtheyarecompletelymissing:

i. SNPsonbothUKBchipandUKBLchip-removethemiftheyaremissinginmorethan3batches(outof33batches)

ii. SNPsontheUKBchipandnottheUKBLchip-removethemiftheyaremissinginmorethan2batches(outof22batches)

iii. SNPsontheUKBLchipandnottheUKBchip-removethemiftheyaremissinginmorethan1batch(outof11batches)

1,037sampleoutliers3wereremoved.Multi-allelicSNPsandSNPswithaminorallelefrequency(MAF)<1%werethenremovedfromthedataset.Thesefiltersresultedinadatasetwith641,018autosomalSNPsin152,256samples.ChromosomeXphasingandimputationwillbecarriedoutatalaterdate.

PhasingmethoddescriptionPhasingontheautosomeswascarriedoutusingamodifiedversionoftheSHAPEIT24programmodifiedtoallowforverylargesamplesizes.Thisnewmethod(whichwerefertoasSHAPEIT3)modifiesSHAPEIT2’ssurrogatefamilyapproachtoremoveaquadraticcomplexitycomponentofthealgorithm5.Insmallsamplesizesofafewthousandsamples,thispartofthealgorithm,whichinvolvescalculatingHammingdistancesbetweencurrenthaplotypesestimates,contributesonlyarelativelysmallparttothecomputationalcost.Assamplesizesincreaseover10,000samplesthenthiscomponentbecomessignificant.Thenewalgorithmusesadivisiveclusteringalgorithmtoidentifyclustersofhaplotypes,andthencalculatesHammingdistancesonlybetweenpairsofhaplotypeswithineachcluster.OnlyhaplotypeswithineachclusterareusedascandidatesforthesurrogatefamilycopyingstatesintheHMMmodel.TheresultingalgorithmhascomplexityO(NlogN)whereNisthenumberofhaplotypesinthedatasetbeingphased.Inpractice,wehaveobservedthatthemethodexhibitsscalingclosetolinear.Thisisacrucialfeatureofthemethod,especiallyforverylargesamplesizes,andapropertynotsharedbyotherapproaches6,7.Thedevelopmentofthisapproachisongoingandthereissubstantialscopetomakefurtherimprovementsinspeedandaccuracy.Anewerversionislikelytoofferanorderofmagnitudereductioninspeed.

Page 5: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

5

ValidationofthephasingmethodTheaccuracyofthisnewmethodwasassessedbytakingadvantageof72mother-father-childtriosthatwereidentifiedintheUKBdataset3.ThisfamilyinformationcanbeusedtoinferthephaseofalargenumberofSNPsinthetrioparents.Thesefamilyinferredhaplotypeswereusedasatruthset,asiscommoninthephasingliterature4.Theparentsofeachtriowereremovedfromthedatasetandthenhaplotypeswereestimatedacrosschromosome20inasinglerunofSHAPEIT3.Thisdatasetconsistedof16,762autosomalSNPs.Theinferredhaplotypeswerethencomparedtothetruthsetusingtheswitcherrormetric4.Weobtainedanexceptionallylowswitcherrorrateof0.4%acrossthetriochildrenreportingBritishancestry.Byadjustingparametersofthemethodwehaveobservedswitcherrorrateslowerthan0.3%.Withswitcherrorratesthislow,longchunksofsequenceofmanymegabaseswillbeinferredcorrectly.Downstreamimputationfromsuchhaplotypeswillbehighlyaccurate.Toassesstheperformancegainofphasingall152,112samplestogether,versusphasinginsmallersubsetsofsamplestwoothertestdatasetsofsize1,072and10,072sampleswerecreated,alsocontainingthetriochildren.TheresultsareshowninfulldetailinTable1andhighlightthebenefitsofjointphasingofallthesamples.TheseresultsclearlydemonstratetheclosetolinearscalingoftheSHAPEIT3algorithm.Samplesize Method SwitchError

(%)Runtime(hrs) Run

TimeScaling

SampleSize

Scaling

Threads

1,072 SHAPEIT3 2.6 0.25 1 1 1010,072 SHAPEIT3 1.3 2.5 10 9.4 10152,112 SHAPEIT3 0.4 38.5 154 142 10

Table1:PhasingperformanceonUKBsamples.

WholegenomephasingPhasingwascarriedoutinchunksof5,000SNPs,withanoverlapof250SNPsbetweenchunks.SHAPEIT3wasrunoneachchunkusing4coresperjobandS=200copyingstates.Asapartofthephasingprocessanyremainingmissinggenotypeswereimputedduringthephasing.Chunkswereligatedusingamodifiedversionofthehapfuseprogram.

Page 6: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

6

Genotypeimputation

AssessmentoftheUKBiobankArrayforimputationTheUKBiobankAxiomarrayfromAffymetrixwasspecificallydesignedtooptimizeimputationperformanceinGWASstudies8.Anexperimentwascarriedouttoassesstheimputationperformanceofthearray,stratifiedbyallelefrequency,andtocompareperformancetosomeothercommerciallyavailablearrays.

Performancewasassessedusinghigh-coverage,whole-genomesequencedatamadepubliclyavailablebyCompleteGenomics(CG).

Datafrom10samplesfromtheEuropeanancestry(CEU)populationwasused.Allvariantsiteswithacallratebelow90%werefilteredoutinordertoonlyconsiderveryreliablesitesintheanalysis.Onlydatafromchromosome20wasused.Tomimicatypicalimputationanalysis,apseudo-GWASdatasetwasconstructedbyextractingtheCGSNPgenotypesatallthesitesincludedonagivenarray.AllsitesnotonthearraywerethenimputedusingtheUK10Kreferencepanel9.ImputationwascarriedoutusingIMPUTE210whichchoosesacustomreferencepanelforeachstudyindividualineach1Mbsegmentofthegenome.ThekhapparameterofIMPUTE2wassetto1,000.Allotherparametersweresettodefaultvalues.Thisexperimentwasrepeatedfor4differentgenome-wideSNParrays(a)AffymetrixUKBiobankAxiomarray(b)IlluminaOmni2.5Marray(c)IlluminaOmni1MQuad(d)IlluminaOmniExpress.Variantswerestratifiedintoallelefrequencybinsandthesquaredcorrelation(R2)wascalculatedbetweenthealleledosagesatvariantsineachbinwiththemaskedCGgenotypes.Sincedifferentarrayscontaindifferentnumbersofvariantsitisimportanttomakesurethatimputationperformanceismeasuredatthesamesetofvariantswhencomparingchips.Toachievethis,bothimputedandarrayvariantswereincludedintheR2analysis,sothatthecomparisonmeasurestheoverallperformanceofeacharray.Asaconsequence,anarraywithmorevariantswillgainanadvantage,asitisreasonabletoexpectthatdirectlygenotypingavariantwillyieldmoreaccurategenotypesthanimputation.Figure1showstheresultsofthisanalysis.Thex-axisisnon-referenceallelefrequency(%)onalogscale,whichfocusesinonrarervariants.They-axisisimputationperformance(R2).Thesalientpointsare

a. theUKBiobankchip(purple)outperformstheIlluminaOmni1MQuad(blue)andIlluminaOmniExpress(green),bothwhichhavecomparablenumbersofvariants.

b. TheUKBiobankchipperformsalmostaswellastheIllumina2.5Mchip(red),whichhas~3timesthenumberofSNPs.ItisworthnotingthattheUKBchipandIlluminaOmni2.5Mchipareverycloseinthe1-5%range.Alikelyconsequenceofthechipdesignprocessfocusinginpartonthisfrequencyrange8.

Page 7: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

7

TheoverallconclusionofthisanalysisisthattheAffymetrixUKBarrayisaverygoodarrayfromwhichtocarryoutgenotypeimputation.ThecaveatisthatthisanalysisisfocusedonsampleswithEuropeanancestry.

Figure1:ComparisonofimputationperformanceoftheUKBiobankArrayandseveralothercommerciallyavailablegenotypingarrays.

ReferencepanelusedforimputationThereareanumberoffactorsthatinfluencetheaccuracyofgenotypeimputation1,butgenerallyaccuracywillincreaseasthenumberofhaplotypesinthereferencepanelgrowsandiftheancestryofthesamplehaplotypesisagoodmatchtotheancestryofthereferencepanelhaplotypes.TheUKBdatasetconsistsofsampleswithadiverserangeofancestries,butwiththemajorityofsampleshavingBritish(orEuropean)ancestry.ForthisreasonitwasdesirabletouseareferencepanelwithalargenumberofhaplotypeswithBritishandEuropeanancestry,andalsoadiversesetofhaplotypesfromotherworld-widepopulations.ToachievethistheUK10Khaplotypereferencepanelwasmergedtogetherwiththe1000GenomesPhase3referencepanelusingthe–merge_ref_panelsoptionintheIMPUTE2software(link).Usingthismergedpanelhasbeenshowntoproduceahigh-qualityreferencepanelforimputation9.AnadvantageofthisreferencepanelisthatitincludesSNPs,shortindelsandlargerstructuralvariants.Thereferencepanelconsistsof87,696,888bi-allelicvariantsin12,570haplotypes.

●●

● ● ● ● ● ● ● ●●

● ●

● ● ● ● ● ● ● ●

●● ●

● ● ● ● ● ● ● ●●

●●

● ● ● ● ● ● ● ●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.02 0.05 0.1 0.2 0.5 1 2 5 10 20 50 100

non reference allele frequency (%)

Aggr

egat

e R

2

Genotyping arrayIllumina Omni 2.5MIllumina Omni 1M QuadIllumina Omni ExpressAffy UK Biobank

Genotyping accuracy after imputation from UK10k (7562 haplotypes)Samples: 10 EUR CG2

Comparison at 219303 sites on chr20 (includes genotyped SNPs)Allele frequency calculated from reference panel

Page 8: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

8

ImputationmethoddescriptionImputationwascarriedoutusingthesamealgorithmasisimplementedintheIMPUTE2program.ThecurrentIMPUTE2programisaveryflexibletoolforphasingandimputationthatimplementsageneralsetofoptions.AnewC++programwaswrittenfromscratchtofocusexclusivelyonhaploidimputationneededwhensampleshavebeenpre-phased.ThisnewversionisbothmemoryandcomputationallyefficientcomparedtoIMPUTE2.ThemethodtakesadvantageofhighcorrelationsbetweeninferredcopyingstatesintheHMMtoreducecomputation.WerefertothisprogramasIMPUTE3.

WholegenomeimputationImputationwascarriedoutinchunksof2Mbwitha250kbbufferregion.Asetof2,000haplotypecopyingstateswereusedtoimputeeachsample.Imputedvariantsineachnon-overlappingpartofeachchunkwereconcatenatedintoper-chromosomefiles.

Informationscores,minorallelefrequenciesandfilteringQCTOOLwasusedtocalculatetheminorallelefrequency(MAF)andimputationinformationscoreofeachimputedvariant.Theimputationinformationisametricbetween0and1.Avalueof1indicatesthatthereisnouncertaintyintheimputedgenotypeswhereasavalueof0meansthatthereiscompleteuncertaintyaboutthegenotypes.AvalueofαinasampleofNindividualsindicatesthattheamountofdataattheimputedSNPisapproximatelyequivalenttoasetofperfectlyobservedgenotypedatainasamplesizeofαN.

ManyGWAScarriedouttodatehaveusedfiltersonMAFandinformationscorebyapplyingathresholdonthesemetrics.Thereisnosinglecorrectthresholdtouse.However,asMAFdecreasesitisgenerallythecasethatimputationqualitydecreases.Previousstudieshavetendedtouseafilteroninformationbetween0.3-0.5.Sincethesestudieshavetypicallyconsistedofhundredsorlowthousandsofsamplesaninformationof0.3correspondstoaneffectivesamplesizewithlimitedpowertodetectassociations.However,theUKBiobankdatasetisconsiderablylargerinsizethanmostpreviousGWAS.Aninformationmeasureof0.3in~150,000samplesroughlycorrespondstoaneffectivesamplesizeof~45,000,whichwouldbeexpectedtoyieldverygoodpowertodetectassociation.

Somevariantsareimputedasmonomorphic,orclosetomonomorphici.e.nooralmostnovariationinthegenotypes.SuchsiteswereremovedusingQCTOOLusingafilteronMAFof0.001%.Inaddition,7sampleswereremovedfromthedatasetduetotheseindividualshavingrequestedtheirdataberemovedfromthestudy.Theresultingdatasetconsistsof73,355,667variantsin152,249individuals.

Thedistributionofinformationscoresatthese73,355,667variantsisshowninFigure2(a).PlotsstratifiedbyMAFarealsoshown(b)MAF>5%(c)1%<=MAF<5%(d)0.1%<=MAF<1%(e)0.01%<=MAF<0.1%(f)0.001%<=MAF<0.01%.

Page 9: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

9

Figure2:Distributionofinformationscoresatvariantsintheimputeddataset.Thex-axisshowstheinformationscoreonthescale0to1.

ImputedgenotypefilesLetGijdenotethegenotypeoftheithsampleatthejthvariant.Theprocessofgenotypeimputationproducesaprobabilitydistributionforeachgenotypei.e.

pij0=P(Gij=AA) pij1=P(Gij=AB) pij2=P(Gij=BB)

whereAandBarethetwoallelesatthevariant.Thisprobabilitytriple(whichsumsto1)isprovidedintheimputedgenotypefilesforeachimputedvariantsinallsamples.SNPvariantsincludedinthephaseddatasetalsooccurintheimputedfilesinthisformat.

TheimputeddataisprovidedinacompressedbinaryBGENfileformat.TheBGENfileformatisabinaryversionoftheGENfileformat.

TheBGENfileformatwaschosentoprovidegoodcompressionoftheimputeddataandeaseofuseforgeneticassociationtestingagainsttraitsandphenotypes.Forexample,programscommonlyusedsuchasSNPTESTandPLINKalreadyreadBGENfiles,andQCTOOLcanbeusedtofilter,summarize,manipulateandconvertthefilestootherformats.

Theformatstoresonevariantatatime(i.e.perrow).AsMAFdecreasesmorecompressionispossibleduetoincreasedsimilaritybetweenimputedgenotypesacross

(a) All variants

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

2e+06

4e+06

6e+06

8e+06

1e+07

(b) MAF >= 5% : #SNPs = 7011470

InformationFrequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(c) 1% <= MAF < 5% : #SNPs = 2889302

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(d) 0.1% <= MAF < 1% : #SNPs = 10051623

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(e) 0.01% <= MAF < 0.1% : #SNPs = 26262886

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

(f) 0.001% <= MAF < 0.01% : #SNPs = 26140277

Information

Frequency

0.0 0.2 0.4 0.6 0.8 1.00e+00

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

Page 10: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

10

samples.ThetotalsizeoftheUKBInterimreleasedatasetis1.3Tb,witheachchromosomefileranginginsizefrom20Gbto109Gb.Asthefileformatisbinarythefilesarenotviewableinnormaltexteditors.Laterinthisdocumentthereisadviceandguidanceonworkingwiththesefiles.

Thefilesarenamedas

chrNimpv1.bgen

whereNisthenumberoftheautosome(N=1,….,22).

RSIDswereaddedintotheBGENfilesforasmanyvariantsaspossibleusingavailableRSIDlistsavailablefromtheUK10Kwebsiteandthe1000Genomeswebsite.

RSIDsareuseful,uniqueidentifiersofSNPsandothervariantsandcanbelookedupinthedbSNPdatabase.WhenresearchersreportassociationsofvariantswithdiseasesandtraitstheynormallyreporttheresultsusingtheRSID.

VariantpositionsarereportedinGenomeReferenceConsortiumHumangenomebuild37co-ordinates(GRChb37).

SamplefilesInadditiontothe22autosomalBGENfiles,thereisfilecalledimpv1.sample

Thisfile(referedtoasthe`samplefile’)isthepartoftheBGENfileformatthatstoresinformationabouteachsampleinthedataset.TheformatofthisfileisdescribedontheGENfileformatwebpage.

Thesamplefilehastwoheaderlines,followedby1lineforeachindividualintheBGENfile.TheorderoftheindividualsinthesamplefilematchestheorderoftheindividualsintheBGENfile.Theorderisimportant.Programsthatreadbgen/samplepairsassumethattheordermatchesbetweenthefiles.

Thesamplefilecanbeusedtostoreinformationabouteachindividuali.e.phenotypesandcovariates.IfphenotypesandcovariatesareaddedintothesamplefilethenSNPTESTcanbeusedtocarryoutassociationtestingateachvariant.Careshouldbetakeninmakingsurethatsuchinformationiscorrectlyaddedtosamplefiles.Theformatallowsdiscreteandcontinuousphenotypesandcovariates,aswellasmissingvalues(seefileformatwebpagelinkabove).

DifferencesbetweenrawgenotypesandimputedfilesSNPsbelow1%MAFwerefilteredoutbeforethephasingstep,howevermanyoftheseSNPswillhavebeenimputed.ThereforetheseSNPswillappearintherawgenotypefiles,andtheimputedfiles,butmayhavedifferentgenotypes.Assuch,researchersshouldnotbesurprisediftheresultsofanalysisattheseSNPsdifferdependentuponwhichfilesareused.

Page 11: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

11

AnexemplargenomewideassociationstudyAGWASforthephenotypeofheightwascarriedouttoassesstheuseoftheUKBiobankgeneticdataasaresourceforgeneticassociationstudies.Therearealreadyasubstantialnumberofreplicatedassociations11.Thepurposeofthisanalysiswasnottoreportnewassociations,butrathertocheckthatareasonablystandardGWASpipelineproducedvalidresults.

SamplefilteringPrincipalcomponentanalysisandtheself-declaredethnicitywereusedtoderivea“WhiteBritish”subsetofsamples.Inaddition,sampleswereexcludediftheyhad

(a) atleastonerelatedsample(b) ageneticallyinferredgenderthatdidnotmatchtheself-reportedgender.(c) ~500extremeoutliers3.

Thesefiltersresultedinadatasetwith112,338samples.

TakingaccountofthedifferentarraysusedSomeSNPsareonlyincludedononeoftheUKBBorUKBLarrays.AtsuchSNPs,missinggenotypeswillhavebeenimputedaspartofthephasingprocess,sothattheseSNPswillconsistofamixtureofgenotypedandimputedSNPs.Thiscanleadtobiasinassociationtestingifthereissomecorrelationbetweenthephenotypeandwhicharrayasamplewasassayedon.ThesamplesinvolvedintheUKBLstudywereselectedbasedonphenotypesassociatedwithlungfunction12,thusitmaybepossibleforsuchassociationstooccur.Thereareatleast2solutionstoameliorateanypossibleconfoundingduetoarray

a. carryoutassociationtestsconditioningonabinaryindicatorofarray.b. carryoutseparatetestsofassociationinUKBBsamplesandUKBLsamplesand

combinetheresultsusingmeta-analysis.

AssociationtestingGWASwasperformedatallvariantsusingSNPTEST.AnadditivegeneticmodelwasfittedateachSNP,usinggender,age,arrayand10principalcomponentsascovariates.Thatis,theexampleusesoption(a)above.Theprogramoption–methodexpectedwasusedintheSNPTESTsoftware,whichconvertsthegenotypeprobabilitytripletoanexpectedgenotype,dij,(oftencalledthedosage),calculatedas

𝑑!" = 𝑘𝑝!"#

!

!!!

Page 12: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

12

ResultsTheGWASforheightproducedasubstantialnumberofassociatedregions.TheseregionshadahighcorrespondencetothosegeneticregionsthathavepreviouslybeenreplicatedforheightanddescribedintheNHGRIGWASCatalog11.Theanalysissuggestedasignificantnumberofnovellocicouldbeidentified.Figure3showsaplotofthe–log10p-valuesfortheheightandBMIscansonchromosome4.

Figure3:Chromosome4GWASforheight.Thex-axisshowsphysicalposition.They-axisis–log10p-valueforeachtestedvariant.Variantsonthearrayareshownasblackdots,imputedvariantsareshownasgreydots.ReportedassociationsfromtheNHGRIGWASCatalogareshownasredcrosses.Theblueandredhorizontallinesaredrawnata–log10p-valueof5and7.5respectively.

FileprocessingWerecommendthatresearchersusetheQCTOOLprogramtohandletheBGENfiles.Thisprogramhasoptionsforextractionorremovalofsubsetsofthedata(SNPsand/orsamples),andforfileformatconversion.SeetheQCTOOLexamplespageforinformationoncommandlinesusedtoperformspecifictasks.TheprogramSNPTESTcanprocessBGENfiles.ItwillautomaticallydetecttheBGENfileformatifdatafilesarenamedwiththe.bgenextension.PLINKv1.9canprocessBGENfiles;atthetimeofwritingBGENfilesarespecifiedusingthe--bgenoption.ForfurtherinformationontoolssupportingtheBGENformat,seetheBGENfileformatwebsite.

Page 13: UKB : UKB Homepage - imputation documentation v1biobank.ctsu.ox.ac.uk/crystal/crystal/docs/impute_ukb_v1.pdf · 2019. 5. 31. · the interim release of the UK Biobank (UKB) genotype

13

References1. Marchini,J.&Howie,B.Genotypeimputationforgenome-wideassociation

studies.Nat.Rev.Genet.11,499–511(2010).2. Howie,B.,Fuchsberger,C.,Stephens,M.,Marchini,J.&Abecasis,G.R.Fastand

accurategenotypeimputationingenome-wideassociationstudiesthroughpre-phasing.Nat.Genet.44,955–959(2012).

3. TheUKBiobank.UKBiobankGenotypingQCdocumentation.(2015).4. Delaneau,O.,Zagury,J.-F.&Marchini,J.Improvedwhole-chromosomephasing

fordiseaseandpopulationgeneticstudies.Nat.Methods10,5–6(2013).5. O'Connell,J.,Sharp,K.,Delaneau,O.&Marchini,J.Haplotypeestimationfor

biobankscaledatasets.(2015)(submitted)6. Kong,A.etal.Detectionofsharingbydescent,long-rangephasingand

haplotypeimputation.Nat.Genet.40,1068–1075(2008).7. Williams,A.L.,Patterson,N.,Glessner,J.,Hakonarson,H.&Reich,D.Phasingof

manythousandsofgenotypedsamples.Am.J.Hum.Genet.91,238–251(2012).8. TheUKBiobankArrayDesignGroup.UKBiobankAxiomArrayContentSummary.

(2014).9. Huang,J.etal.Improvedimputationoflow-frequencyandrarevariantsusing

theUK10Khaplotypereferencepanel.NatureCommunications6,8111(2015).10. Howie,B.,Marchini,J.&Stephens,M.Genotypeimputationwiththousandsof

genomes.G3(Bethesda)1,457–470(2011).11. Welter,D.etal.TheNHGRIGWASCatalog,acuratedresourceofSNP-trait

associations.Nucl.AcidsRes.42,D1001–6(2014).12. Wain,L.V.etal.Novelinsightsintothegeneticsofsmokingbehaviour,lung

function,andchronicobstructivepulmonarydisease(UKBiLEVE):ageneticassociationstudyinUKBiobank.LancetRespirMed3,769–781(2015).