ricopili: imputation tutorialricopili imputation jobs •individuals in each dataset get split into...

Ricopili: Imputation Module

WCPG Education Day

Stephan Ripke / Raymond Walters

Toronto, October 2015

Ricopili Overview

Outline for this session

• Ricopili’s approach to imputation

– Reference alignment

– Efficient imputation

– Post-processing

• Usage and output structure

Why Impute?

• Meta-analysis– Smooth out array differences

• Fine mapping– Many more markers to look at

• Fill in missing data

• Add non-SNP variation– e.g. small indels

Imputation Overview

Tasks in imputation module

1. Align genotype data to reference

2. Pre-phase haplotypes from genotypes

3. Impute using reference panel

4. Get imputation results

– Dosages

– Best guess genotypes

– Info scores

Imputation Details

Workflow of full module on github:https://github.com/Nealelab/ricopili/wiki

https://github.com/Nealelab/ricopili/wiki

Aligning to reference

• Need to ensure genotypes are aligned to the

reference panel before phasing

– Same genome build

• LiftOver if needed

– Same mapping of genetic locations

– Resolve any strand flips, allele swaps

• Careful handling of strand ambiguous SNPs

• Consult population allele frequencies

• Ricopili automates this process

How does imputation work?

• We can model our observed genotypes as a mosaic of the reference haplotypes

Marchini & Howie, 2010, Nature Reviews Genetics

How does imputation work?

• Different algorithms for modeling the haplotypes for

the genotypes

– MaCH: Hidden Markov Model

– Impute2: HMM (phase) + MCMC (uncertainty)

– BEAGLE: graphical model

• Ricopili uses Impute2(+Shapeit) by default

– formerly all three algorithms integrated, if wished we can

integrate again

Common Reference Panels

• 1000 Genomes

– Phase 1• chrX as a separate reference panel

– Phase 3• Additional populations vs. phase 1

– South Asia– Additional African populations

• HLA with Amino Acids (Paul deBakker)

• Easy to integrate another reference (on the administrator level)

Imputation, 11 steps1) Guess genome build

2) Align positions

3) Align alleles

4) Cut into genomic chunks

5) Prephasing

6) Imputation

7) Data Reformatting

8) Postimputation QC / Best Guess

9) Genome wide best guess

10) Clean

11) Evaluate harddisk usage (~40Mb / ID, 40GB / 1000 IDs)

• 1000 individuals: 4 hours• 15,000 individuals: 48 hours

Steps 5,6,7 take > 90% of the computer resources of this module

Divide datasets for Parallel Imputation

• 929 genomic chunks with 5Mb each

– Overlapping window of 1Mb on each side to next chunk

– So 3MB of each chunk are kept for downstream analyses.

• 929 Parallel jobs for each dataset (Nd)

– Total 929 x Nd parallel jobs get sent for each step

– More if N>1500

• Total time depends on how free the cluster is

– Prephasing, Imputation each up to several hours per job

Ricopili Prephasing Jobs

• Phasing for each of the 929 genomic chunks– 929 Parallel jobs sent for each dataset (Nd)

• Total time depends on how free the cluster is– Depends strongly on dataset size (no split of individuals)

– Each job up to several hours

– One of two most time consuming step (fewer, but longer jobs than imputation)

• Failed jobs due to long runtime get re-sent with higher multithreading values*

* Not fully implemented on all infrastructures

Ricopili Imputation Jobs

• Individuals in each dataset get split into parts with max. 1500 individuals

• Minimum of 929 x Nd parallel jobs get sent

– If datasets with > 1500 are present number of parallel jobs raise significantly

• Total time depends on how free the cluster is

– each job up to several hours

– One of two most time consuming steps (more but shorter than prephasing)

Ricopili Post-Imputation Processing

• Datasets with > 1500 individuals re-merged

• 3 probabilities into 2 probabilities – saves 1/3 of hard disk space

• Split into qc1 and qc1f, delete probabilities of qc1f

• Outdated: Best guess per chunk (combine to whole genome on a later timepoint)

• 929 x Nd parallel jobs get sent – A lot of I/O, so restrict to 100 parallel jobs*

– Total time within hours* Not fully implemented on all infrastructures

What do we get from imputation?

• Dosages

– Per SNP per individual, the probability of each genotype

• E.g. 1.5% aa, 98% Aa, 0.5% AA

• Best guess genotypes

– Genotype with highest probability with minimum threshold (default 0.8)

• E.g. for the above dosage: Aa

• Highest probability below threshold set as missing

• Different levels of missing-rates and frequencies for different purposesa) No additional filter (~10M SNPs)

b) Loose filter for SNP analyses (5-8M SNPs)

c) Strict filter for PCA analysis (2-4M SNPs)

• Info scores

– Ratio of variances (observed / expected)

– Metric of imputation quality for each SNP

• Scaled roughly 0-1

Outline for this session

• Ricopili’s approach to imputation

– Reference alignment

– Efficient imputation

– Post-processing

• Usage and output structure

Output structure - Overview

Ricopili Output Structure


• QC1• Dosages• Very light QC

• Info > 0.1• MAF > .005

• QC1f• Dosages failing light QC• Dosages are not kept, just

SNP lists (meta-information found in subdir “info”)


• BG• Best guess genotypes• Light QC

• Missing-rate < 2%

• BGS• Best guess with stricter QC

• Missing-rate < 1%• MAF > 5%

• BGN• Best guess, no QC compared

to qc1- dosages


• Info• Info scores, original output

files from imputation algorithm (impute2)


Whole genome best-guess genotypes :

Three units for each dataset (see above)

• bg

• bgn

• bgs


pcaer_sub:

• contains the collection of whole

genome best guess genotypes from

with strict QC

• Contains README with instruction

how to start pca-pipeline to get

• Covariates over all datasets

• Deduped ID collection over all

datasets


dasu_*:

• Used for intermediate dosages:

• Important meta-files are kept and

zipped, otherwise empty

errandout:

• Keeps output of jobs from

motherscripts (not working scripts)

Blueprint_bak:

• Backup of job-starting commands,

keeps root directory clean without

loosing information


pi_sub:

• Used for intermediate steps:

• Aligning

• Chunking

• Prephasing

• Imputation

• Important meta-files are kept and

zipped.

• Look for *job* files to get a list of

scripts that got sent into the queue

• errandout:

• Keeps output of jobs from

working scripts

Output structure - Details

Detailed look at pi_sub (*job* files)

• buigue: guessing genome build, lifts SNPs to hg19 if necessary, does not change number of SNPs– *noma_comp: listing details to distinct builds

– *noma: listing details for nonmatched SNPs

– *buigue: best matching build

– *liftover_script: liftover_script

– *liftover: final liftover_command (if it ran)

– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.bim/bed/fam


• Chepos (“checkpos6”): – extracts rs-names from snp-names

(“PsychChip_15048346_B|newrs111033171”): *.hg19.bim.ow.det

– Extracts positions from snp-names (“PsychChip_15048346_B|chr12_57594552”): *.hg19.bim.ow.det

– Translate position into snp-name if snp-name not found in reference: *.hg19.bim.addpos.det

– Translate snp-name in position: *.hg19.bim.xchr / xkb– Remove SNPs not found in reference: *.hg19.bim.npos– All detailed files in tar-ball: *.ch.tar.gz– Summary report: *.hg19.ch.report– Collection of commands: *. hg19.bim.chepos.cmd– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.ch.bim/bed/fam


• Checkflip (“checkflip4”): – Flips unambiguous SNPs (non – AT, non - CG) (*.fli)– ambiguous SNPs (AT, CG)

• Very common freq (default MAF > 0.4) are removed (*.uif)• Others are flipped aligning minor allele (*.fli) ***

– Remove non matching alleles (*.xal): rs13172324 CT CG– Remove SNPs with big freq-difference to reference (default

15%) (*.bf) ***– All detailed files in tar-ball: *.ch.fl.tar.gz– Summary report: *.hg19.ch.fl.report– Collection of commands: *. hg19.bim.chefli.cmd– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.ch.fl.bim/bed/fam

*** reference population needs to fit


• Chuck (“my.chuck2”): – Extracts one chunk of the genome (based on the

929 chunks in the reference and saves in subdir(“subfile_*”)

– Label chunk with genomic locations:• chr22_048_051: has SNPs on chromosome 22 from

47Mb to 52Mb. After imputation 1Mb on each side will be chopped off

– If no SNPs, saves information in subdir(“empty_*”)


• prephasing (“my.preph”): – Prephasing one chunk with shapeit2

(https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html)

– Output-directory: haps_*– All individuals together– Temporary different individual IDs, since shapeit does not

accept long FIDs.– After prephasing split into units of max 1500 individuals

(default)– Sets info about multithreading (so that it can increase in next

try)– Shapeit command: *.shapeit.cmd– Split command: *.split.cmd


• imputation (“my.imp2.3”):

– Imputation of prephased (and possibly) split chunk with world wide imputation reference.

– Impute2 (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html)

– Output-directory: pi_*

– Original impute – info scores: *_info

– All other impute meta output files are kept as well

– Impute2 command: *.imp2.cmd


• Dosage format (”haps2dos3”):

– convert 3 probabilities into two,

– reintegrate original identifiers

– Create file that (ngt) keeps information about imputed vs. genotyped

– Output-directory: ../dasu_*

– Plink commands: *.dos.cmd


• Postimp QC and best guess (”daner_bg”)• Output-directory: ../dasuqc1_*

– Subdir QC1• Dosages, passed very light QC (default: Info > 0.1, MAF > .005)

– Subdir QC1f• Snp lists of failed QC

– BGN• Best guess (highest prob, threshold default: 0.8), no further QC• For comparison to dosages

– BG• Best guess genotypes, Light QC, Missing-rate < 2%• For SNP analyses

– BGS• Best guess with strict QC (Missing-rate < 1%, MAF > 5%)• For PCA

– info• Info scores, original output files from imputation algorithm (impute2)• Plink commands: *.bg.cmd

Detailed look at cobg_dir_genome_wide (*job* files)

• Whole genome best guess (”comb_bg_dir”)

• Combine all 929 chunks within each dataset for BGN, BG, BGS.

• Bring all datasets into one subdir

• Plink commands: *.wbg.cmd

Detailed look at pcaer_sub

• Whole genome best guess with strict filtering

• README.pcaer with command to start pcaeron postimputation best guess genotypes with all datasets combined.

Detailed look at clean.job_list

• Cleans temporary subdirs, removing unnecessary files, packing up meta-files.

• du_out_*:

– lists all subdirs and their harddisk usage.

Detailed look at reference_info

• Lists version and location of Imputation reference, also lists all genomic chunks

• Important starting file for –refiex option

imputation options (--help)

• --phase and –refdir: choose imputation reference

• --popname: define reference for frequency-checks– eur, asn, amr, afr, asw

– --sfh frequency thresholds for excluding common ambiguous SNPs

– --fth frequency difference threshold to reference

• --triset: imputation of trios

• --spliha: different max-size of individuals going into imputation engine

imputation options, cont.

• Postimputation QC– --info_th: info score– --freq_th: MAF– --bg_th: min probability to call a best guess

• --noclean: keep all intermediate files (e.g. for debugging)• --force1: if pipeline stopped and problem seems solved

now, then restart with this option• --sjamem_incr: increase memory for working jobs (e.g.

memory request doesn’t seem to be enough for big single datasets, >10K)

• --refiex: exclude chunks from imputation– used from copy of reference_info for single chunks (for

debugging)

Wrap-up

• Questions?

ricopili: imputation tutorialricopili imputation jobs •individuals in each dataset get split into...

Documents