ricopili: imputation tutorialricopili imputation jobs •individuals in each dataset get split into...

45
Ricopili: Imputation Module WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Upload: others

Post on 11-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili: Imputation Module

WCPG Education Day

Stephan Ripke / Raymond Walters

Toronto, October 2015

Page 2: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Overview

Page 3: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Overview

Page 4: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Outline for this session

• Ricopili’s approach to imputation

– Reference alignment

– Efficient imputation

– Post-processing

• Usage and output structure

Page 5: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Outline for this session

• Ricopili’s approach to imputation

– Reference alignment

– Efficient imputation

– Post-processing

• Usage and output structure

Page 6: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Why Impute?

• Meta-analysis– Smooth out array differences

• Fine mapping– Many more markers to look at

• Fill in missing data

• Add non-SNP variation– e.g. small indels

Page 7: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Imputation Overview

Tasks in imputation module

1. Align genotype data to reference

2. Pre-phase haplotypes from genotypes

3. Impute using reference panel

4. Get imputation results

– Dosages

– Best guess genotypes

– Info scores

Page 8: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Imputation Details

Workflow of full module on github:https://github.com/Nealelab/ricopili/wiki

Page 9: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Aligning to reference

• Need to ensure genotypes are aligned to the

reference panel before phasing

– Same genome build

• LiftOver if needed

– Same mapping of genetic locations

– Resolve any strand flips, allele swaps

• Careful handling of strand ambiguous SNPs

• Consult population allele frequencies

• Ricopili automates this process

Page 10: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

How does imputation work?

• We can model our observed genotypes as a mosaic of the reference haplotypes

Marchini & Howie, 2010, Nature Reviews Genetics

Page 11: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

How does imputation work?

• Different algorithms for modeling the haplotypes for

the genotypes

– MaCH: Hidden Markov Model

– Impute2: HMM (phase) + MCMC (uncertainty)

– BEAGLE: graphical model

• Ricopili uses Impute2(+Shapeit) by default

– formerly all three algorithms integrated, if wished we can

integrate again

Page 12: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Common Reference Panels

• 1000 Genomes

– Phase 1• chrX as a separate reference panel

– Phase 3• Additional populations vs. phase 1

– South Asia– Additional African populations

• HLA with Amino Acids (Paul deBakker)

• Easy to integrate another reference (on the administrator level)

Page 13: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Imputation, 11 steps1) Guess genome build

2) Align positions

3) Align alleles

4) Cut into genomic chunks

5) Prephasing

6) Imputation

7) Data Reformatting

8) Postimputation QC / Best Guess

9) Genome wide best guess

10) Clean

11) Evaluate harddisk usage (~40Mb / ID, 40GB / 1000 IDs)

• 1000 individuals: 4 hours• 15,000 individuals: 48 hours

Steps 5,6,7 take > 90% of the computer resources of this module

Page 14: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Divide datasets for Parallel Imputation

• 929 genomic chunks with 5Mb each

– Overlapping window of 1Mb on each side to next chunk

– So 3MB of each chunk are kept for downstream analyses.

• 929 Parallel jobs for each dataset (Nd)

– Total 929 x Nd parallel jobs get sent for each step

– More if N>1500

• Total time depends on how free the cluster is

– Prephasing, Imputation each up to several hours per job

Page 15: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Prephasing Jobs

• Phasing for each of the 929 genomic chunks– 929 Parallel jobs sent for each dataset (Nd)

• Total time depends on how free the cluster is– Depends strongly on dataset size (no split of individuals)

– Each job up to several hours

– One of two most time consuming step (fewer, but longer jobs than imputation)

• Failed jobs due to long runtime get re-sent with higher multithreading values*

* Not fully implemented on all infrastructures

Page 16: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Imputation Jobs

• Individuals in each dataset get split into parts with max. 1500 individuals

• Minimum of 929 x Nd parallel jobs get sent

– If datasets with > 1500 are present number of parallel jobs raise significantly

• Total time depends on how free the cluster is

– each job up to several hours

– One of two most time consuming steps (more but shorter than prephasing)

Page 17: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Post-Imputation Processing

• Datasets with > 1500 individuals re-merged

• 3 probabilities into 2 probabilities – saves 1/3 of hard disk space

• Split into qc1 and qc1f, delete probabilities of qc1f

• Outdated: Best guess per chunk (combine to whole genome on a later timepoint)

• 929 x Nd parallel jobs get sent – A lot of I/O, so restrict to 100 parallel jobs*

– Total time within hours* Not fully implemented on all infrastructures

Page 18: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

What do we get from imputation?

• Dosages

– Per SNP per individual, the probability of each genotype

• E.g. 1.5% aa, 98% Aa, 0.5% AA

• Best guess genotypes

– Genotype with highest probability with minimum threshold (default 0.8)

• E.g. for the above dosage: Aa

• Highest probability below threshold set as missing

• Different levels of missing-rates and frequencies for different purposesa) No additional filter (~10M SNPs)

b) Loose filter for SNP analyses (5-8M SNPs)

c) Strict filter for PCA analysis (2-4M SNPs)

• Info scores

– Ratio of variances (observed / expected)

– Metric of imputation quality for each SNP

• Scaled roughly 0-1

Page 19: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Outline for this session

• Ricopili’s approach to imputation

– Reference alignment

– Efficient imputation

– Post-processing

• Usage and output structure

Page 20: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Output structure - Overview

Page 21: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

Page 22: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

Page 23: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

• QC1• Dosages• Very light QC

• Info > 0.1• MAF > .005

• QC1f• Dosages failing light QC• Dosages are not kept, just

SNP lists (meta-information found in subdir “info”)

Page 24: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

• BG• Best guess genotypes• Light QC

• Missing-rate < 2%

• BGS• Best guess with stricter QC

• Missing-rate < 1%• MAF > 5%

• BGN• Best guess, no QC compared

to qc1- dosages

Page 25: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

• Info• Info scores, original output

files from imputation algorithm (impute2)

Page 26: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

Whole genome best-guess genotypes :

Three units for each dataset (see above)

• bg

• bgn

• bgs

Page 27: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

pcaer_sub:

• contains the collection of whole

genome best guess genotypes from

with strict QC

• Contains README with instruction

how to start pca-pipeline to get

• Covariates over all datasets

• Deduped ID collection over all

datasets

Page 28: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

dasu_*:

• Used for intermediate dosages:

• Important meta-files are kept and

zipped, otherwise empty

errandout:

• Keeps output of jobs from

motherscripts (not working scripts)

Blueprint_bak:

• Backup of job-starting commands,

keeps root directory clean without

loosing information

Page 29: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Ricopili Output Structure

pi_sub:

• Used for intermediate steps:

• Aligning

• Chunking

• Prephasing

• Imputation

• Important meta-files are kept and

zipped.

• Look for *job* files to get a list of

scripts that got sent into the queue

• errandout:

• Keeps output of jobs from

working scripts

Page 30: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Output structure - Details

Page 31: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• buigue: guessing genome build, lifts SNPs to hg19 if necessary, does not change number of SNPs– *noma_comp: listing details to distinct builds

– *noma: listing details for nonmatched SNPs

– *buigue: best matching build

– *liftover_script: liftover_script

– *liftover: final liftover_command (if it ran)

– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.bim/bed/fam

Page 32: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• Chepos (“checkpos6”): – extracts rs-names from snp-names

(“PsychChip_15048346_B|newrs111033171”): *.hg19.bim.ow.det

– Extracts positions from snp-names (“PsychChip_15048346_B|chr12_57594552”): *.hg19.bim.ow.det

– Translate position into snp-name if snp-name not found in reference: *.hg19.bim.addpos.det

– Translate snp-name in position: *.hg19.bim.xchr / xkb– Remove SNPs not found in reference: *.hg19.bim.npos– All detailed files in tar-ball: *.ch.tar.gz– Summary report: *.hg19.ch.report– Collection of commands: *. hg19.bim.chepos.cmd– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.ch.bim/bed/fam

Page 33: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• Checkflip (“checkflip4”): – Flips unambiguous SNPs (non – AT, non - CG) (*.fli)– ambiguous SNPs (AT, CG)

• Very common freq (default MAF > 0.4) are removed (*.uif)• Others are flipped aligning minor allele (*.fli) ***

– Remove non matching alleles (*.xal): rs13172324 CT CG– Remove SNPs with big freq-difference to reference (default

15%) (*.bf) ***– All detailed files in tar-ball: *.ch.fl.tar.gz– Summary report: *.hg19.ch.fl.report– Collection of commands: *. hg19.bim.chefli.cmd– Resulting dataset: mix_gpc1_eur_sr-qc.hg19.ch.fl.bim/bed/fam

*** reference population needs to fit

Page 34: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• Chuck (“my.chuck2”): – Extracts one chunk of the genome (based on the

929 chunks in the reference and saves in subdir(“subfile_*”)

– Label chunk with genomic locations:• chr22_048_051: has SNPs on chromosome 22 from

47Mb to 52Mb. After imputation 1Mb on each side will be chopped off

– If no SNPs, saves information in subdir(“empty_*”)

Page 35: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• prephasing (“my.preph”): – Prephasing one chunk with shapeit2

(https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html)

– Output-directory: haps_*– All individuals together– Temporary different individual IDs, since shapeit does not

accept long FIDs.– After prephasing split into units of max 1500 individuals

(default)– Sets info about multithreading (so that it can increase in next

try)– Shapeit command: *.shapeit.cmd– Split command: *.split.cmd

Page 36: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• imputation (“my.imp2.3”):

– Imputation of prephased (and possibly) split chunk with world wide imputation reference.

– Impute2 (https://mathgen.stats.ox.ac.uk/impute/impute_v2.html)

– Output-directory: pi_*

– Original impute – info scores: *_info

– All other impute meta output files are kept as well

– Impute2 command: *.imp2.cmd

Page 37: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• Dosage format (”haps2dos3”):

– convert 3 probabilities into two,

– reintegrate original identifiers

– Create file that (ngt) keeps information about imputed vs. genotyped

– Output-directory: ../dasu_*

– Plink commands: *.dos.cmd

Page 38: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pi_sub (*job* files)

• Postimp QC and best guess (”daner_bg”)• Output-directory: ../dasuqc1_*

– Subdir QC1• Dosages, passed very light QC (default: Info > 0.1, MAF > .005)

– Subdir QC1f• Snp lists of failed QC

– BGN• Best guess (highest prob, threshold default: 0.8), no further QC• For comparison to dosages

– BG• Best guess genotypes, Light QC, Missing-rate < 2%• For SNP analyses

– BGS• Best guess with strict QC (Missing-rate < 1%, MAF > 5%)• For PCA

– info• Info scores, original output files from imputation algorithm (impute2)• Plink commands: *.bg.cmd

Page 39: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at cobg_dir_genome_wide (*job* files)

• Whole genome best guess (”comb_bg_dir”)

• Combine all 929 chunks within each dataset for BGN, BG, BGS.

• Bring all datasets into one subdir

• Plink commands: *.wbg.cmd

Page 40: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at pcaer_sub

• Whole genome best guess with strict filtering

• README.pcaer with command to start pcaeron postimputation best guess genotypes with all datasets combined.

Page 41: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at clean.job_list

• Cleans temporary subdirs, removing unnecessary files, packing up meta-files.

• du_out_*:

– lists all subdirs and their harddisk usage.

Page 42: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Detailed look at reference_info

• Lists version and location of Imputation reference, also lists all genomic chunks

• Important starting file for –refiex option

Page 43: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

imputation options (--help)

• --phase and –refdir: choose imputation reference

• --popname: define reference for frequency-checks– eur, asn, amr, afr, asw

– --sfh frequency thresholds for excluding common ambiguous SNPs

– --fth frequency difference threshold to reference

• --triset: imputation of trios

• --spliha: different max-size of individuals going into imputation engine

Page 44: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

imputation options, cont.

• Postimputation QC– --info_th: info score– --freq_th: MAF– --bg_th: min probability to call a best guess

• --noclean: keep all intermediate files (e.g. for debugging)• --force1: if pipeline stopped and problem seems solved

now, then restart with this option• --sjamem_incr: increase memory for working jobs (e.g.

memory request doesn’t seem to be enough for big single datasets, >10K)

• --refiex: exclude chunks from imputation– used from copy of reference_info for single chunks (for

debugging)

Page 45: Ricopili: Imputation TutorialRicopili Imputation Jobs •Individuals in each dataset get split into parts with max. 1500 individuals •Minimum of 929 x Nd parallel jobs get sent

Wrap-up

• Questions?