xwas (version 2.0): a toolset for chromosome x-wide data...

XWAS (version 2.0): a toolset for chromosome X-wide data

analysis and association studies

Many members Alon Keinan’s lab have contributed between 2012 and 2017to the design, development, and testing of different versions of this

software package, as well as to the development of the analytical methods involved.

They are listed here in in alphabetical order:Leonardo Arbiza, Arjun Biddanda, Diana Chang, Feng Gao, Yingjie Guo,Lauren J. Lo, Li Ma, Yuhuan Qiu, Kaixiong Ye, Liang Zhang, Zilu Zhou

May 12, 2017

Contents

1 Background 2

2 Downloading and Extracting XWAS 2

3 Genotype Calling 33.1 Prerequisites and Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Parameter File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Individuals File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.5 Comparing Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Quality-Control Pipeline 64.1 Pre-QC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 General Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.3 X-Specific Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.4 Specifying the QC Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Imputation 95.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.2 Reference Files Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.3 Parameter File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.4 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6 Post Imputation Quality Control 12

7 Association Testing with XWAS 127.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8 Gene-based Association Testing on X 158.1 Automated Gene Based Association Tests . . . . . . . . . . . . . . . . . . . . . . 158.2 Single Gene Based Association Test . . . . . . . . . . . . . . . . . . . . . . . . . 16

9 Gene-based Gene-gene Interaction Testing 18

10 Troubleshooting 19

11 References 20

1

1 Background

This manual describes the implementation and use of the XWAS software package (chromosomeX-Wide Analysis toolSet, v2.0). The package is designed to perform single-marker and gene-based association analyses of chromosome X. It also includes quality control (QC) proceduresfor X-linked data, an imputation pipeline, and a genotype calling pipeline. For convenience,the abbreviation “X” is used to refer to chromosome X throughout the manual. XWAS version2.0 is based on version 1.1. In addition to various bug fixes, XWAS version 2.0 includes thefollowing new features:

1. A new pipeline for genotype calling from raw intensity data using BIRDSUITE.

2. A new script to compare the genotype calling of datasets.

3. Quality control pipeline for X chromosome now supports --quiet, --verbose, and --debug

flags.

4. Improved troubleshooting, including a script to check the genome build and strand align-ment of a dataset.

For more details about v1.0, please refer to:

1. Chang D, Gao F, Slavney A, Ma L, Waldman YY, Sams AJ, Billing-Ross P, Madar A,Spritz R, Keinan A. 2014. No eXceptions: Accounting for the X chromosome in GWAS revealsX-linked genes implicated in autoimmune diseases. PLoS ONE 9(12): e113684.

2. Gao F, Chang D, Biddanda A, Ma L, Guo Y, Zhou Z, Keinan A. 2015. XWAS: a toolsetfor chromosome X-wide data analysis and association studies. Journal of Heredity.

2 Downloading and Extracting XWAS

The software package can be freely downloaded from the Keinan lab website(http://keinanlab.cb.bscb.cornell.edu/ or http://www.alonkeinan.com):

http://keinanlab.cb.bscb.cornell.edu/data/xwas/XWAS_v2.0.tar.gz

After downloading the file, use the following command to extract locally:

tar -zxvf XWAS v2.0.tar.gz

All necessary binary files and scripts are included in the folder ./bin/. The user can also compilethe XWAS software on their system by downloading the source code from http://keinanlab.cb.

bscb.cornell.edu/data/xwas/xwas_src.tar.gz and building the executable using the follow-ing commands:

tar -zxvf xwas src.tar.gz

cd xwas src

make

In our current version, the XWAS software package is optimized for LINUX, but can be compiledfor Windows and MAC OS. The provided XWAS software is compiled on Linux machine. You maywant to consult MAKEFILE to make necessary changes on other machines.

2

http://keinanlab.cb.bscb.cornell.edu/sites/default/files/papers/chang_etal_2014_accountingexentricities_plosone.pdf


http://jhered.oxfordjournals.org/content/early/2015/08/12/jhered.esv059.full.pdf


http://keinanlab.cb.bscb.cornell.edu/

http://www.alonkeinan.com

http://keinanlab.cb.bscb.cornell.edu/data/xwas/XWAS_v2.0.tar.gz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/xwas_src.tar.gz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/xwas_src.tar.gz

3 Genotype Calling

We have implemented a pipeline that performs genotype calling on raw intensity data usingBIRDSUITE (Altshuler et al. 2008). Additionally, if you wish to re-call a dataset that you suspectmay have been genotyped incorrectly, we provide a simple method to compare BIRDSUITE’s callsto your previous calls. If you do not wish you re-call your data or do not have access to the rawintensity data, you may proceed to the next section.

3.1 Prerequisites and Set Up

BIRDSUITE currently supports intensity data from Affymetrix Genome-Wide Human SNP Array6.0 and Affymetrix Genome-Wide Human SNP Array 5.0. Additionally, the pipeline requiresthe following:

Java 1.5 or higherPython 2.5.2 or higherPython package numpy 1.2 or higher (http://www.numpy.org)R 2.4 or higherR package mclust 3 or higher (https://cran.r-project.org/package=mclust)

To set up the pipeline, navigate to $path/genotyping pipeline/bin. For the remainder ofthis section, $path will denote the location of the genotyping pipeline directory. If you are theadmin of your system, run the following command:

sudo easy install --script-dir=./ *py[VERSION].egg

where VERSION is your system’s version of Python. Note that the egg packages are optimizedfor Python 2.5 and 2.7. Contact our team for compatibility with other version of Python. Ifyou are not the admin of your system, run the following command:

easy install --install-dir=INSTALL DIR --script-dir=./ *py[VERSION].egg

where again VERSION is your system’s version of Python and INSTALL DIR is your desired locationfor installing Python packages. Make sure INSTALL DIR is in your $PYTHONPATH environmentvariable.

Next, install the three R packages by running the following commands in order:

R CMD INSTALL -l ./ broadgap.utils 1.0.tar.gz

R CMD INSTALL -l ./ broadgap.cnputils 1.0.tar.gz

R CMD INSTALL -l ./ broadgap.canary 1.0.tar.gz

Next, download and install the Matlab compiled runtime for your appropriate system. Down-load the following compressed file:

For 64-bit systems: http://keinanlab.cb.bscb.cornell.edu/data/xwas/MCRInstaller.75.glnxa64.bin.gz

For 32-bit systems: http://keinanlab.cb.bscb.cornell.edu/data/xwas/MCRInstaller.75.glnx86.bin.gz

3

http://keinanlab.cb.bscb.cornell.edu/data/xwas/MCRInstaller.75.glnxa64.bin.gz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/MCRInstaller.75.glnxa64.bin.gz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/MCRInstaller.75.glnx86.bin.gz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/MCRInstaller.75.glnx86.bin.gz

Decompress the file by using the following command:

gzip -d MCRInstaller.75.[VERSION].bin.gz

with your respective package. Then run the command ./MCRInstaller.75.[VERSION].bin

-console, again with your respective package. When prompted for a destination directory, re-ply with MCR75 glnxa64 for 64-bit systems and MCR75 glnx86 for 32-bit systems.MCRInstaller.75.[VERSION].bin may be deleted after installation.

Download the metadata directory at http://keinanlab.cb.bscb.cornell.edu/data/xwas/

metadata.tar.gz. Place the compressed directory in $path/genotyping pipeline. Decom-press the directory by using the following command:

tar -zxvf metadata.tar.gz

3.2 Parameter File

The parameters for genotyping should be provided in a parameter file with an similar structureto example.params under the folder $path/genotyping pipeline. The following argumentsare required:

1. chipType=CHIP TYPE: Name of Affymetrix chip type used. Acceptable arguments areGenomeWideSNP 6 and GenomeWideSNP 5

2. famFile=FAM FILE: Path to the PLINK format .fam file for your data.

The following arguments are optional:

1. canary.priors=CANARY MODELS FILE: Use a specific models file to aid in clustering com-mon copy-number polymorphism. For example, setting this option to$path/genotyping pipeline/metadata/GenomeWideSNP 6.CEU.canary priors will resultin better clustering for samples of European ancestry. The default is a models file builtusing all 270 HapMap samples. Additional models files can be found in$path/genotyping pipeline/metadata/*priors.

2. canary.allele freq weight=WEIGHT: How much to weight observed CNP frequency fromHapMap to aid in clustering. The recommended value is 32 if samples are of Europeanancestry and one is using the GenomeWideSNP 6.CEU.canary priors models file.

3. genomeBuild=GENOME BUILD: Use the specified genome build. hg18 is the default. Cur-rently hg17 and hg18 are supported.

4. apt probeset summarize.force: Use this option if one or more of the cel files wereprocessed with a non-standard CDF.

5. noLsf: By default, BIRDSUITE uses the Load-Sharing Facility (LSF) to parallelize parts ofthe pipeline. Use this option if you do not have LSF installed on your system, or do notwish to use LSF.

6. celMap=CEL MAP: If you wish, you may provide a cel map file. This map can be used toconvert normally complicated cel names to a more common naming scheme. Each line

4

http://keinanlab.cb.bscb.cornell.edu/data/xwas/metadata.tar.gz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/metadata.tar.gz

of the file should contain two columns: the cel name, and the new individual name. See$path/genotyping pipeline/example.celMap for an example.

7. threshold=THRESHOLD: Each of BIRDSUITE’s calls are made with a certain confidence (0indicates most confident, 1 indicates least confident). Define a confidence cutoff value forwhich to exclude data from your final dataset.

8. outputName=NAME: Define the file name root for the output files.

3.3 Individuals File

Information about each cel file should be provided in a file with an identical structure toexample.cels under the folder $path/genotyping pipeline. The file should contain a linefor each cel file. Each line should have three columns:

1. Path to the cel file.

2. Gender of the individual. 0 indicates female, 1 indicates male, 2 indicates unknown.

3. Batch of the cel file. Samples that were run on the same chemistry plate belong to thesame batch.

3.4 Usage

Navigate to the directory $path/genotyping pipeline. Place your .params file and .cels inthis directory. Execute the genotyping pipeline using the following command:

python call genotypes.py FILE.params FILE.cels OUTPUT

where FILE is the name of your .params and .cels files and OUTPUT is your desired output direc-tory. The output will be a PLINK formatted dataset (.tped and .tfam) along with many outputfiles from BIRDSUITE detailing the specifics of the genotype calling. The full description of eachoutput file can be found at https://www.broadinstitute.org/birdsuite/output-files.

3.5 Comparing Calls

After genotyping your dataset, you may wish to compare your new calls to a previous set ofgenotype calls. For your convenience, we have included a script which will summarize the dif-ferences between two datasets for you. After navigating to $path/genotyping pipeline, youmay envoke this script using the following command:

python compare calls.py OLD ROOT NEW ROOT

where OLD ROOT is the file root name of your old calls and NEW ROOT is the file root name of yournew calls. Make sure both datasets are in the same directory as compare calls.py. This scriptwill produce three output files. The first file, snps.diff, will have the following columns:

[SNP ID] [NEW ROOT] [OLD ROOT] [number of different calls]

where each row corresponds to a SNP ID which was called differently between the two datasets.NEW ROOT is a 0 if the SNP was missing from the new calls, and a 1 if it was included. OLD ROOT

5

https://www.broadinstitute.org/birdsuite/output-files

follows this format as well. The final column denotes how many individuals this SNP was dif-ferentially called for.

The second output file, included individuals.diff, will have the following columns:

[IID] [number of different calls]

where each row corresponds to an individual which was included in both datasets. The secondcolumn denotes how many SNPs in this individual were differentially called between the twodatasets.

The final output file, excluded individuals.diff, will have the following columns:

[IID] [NEW ROOT] [OLD ROOT]

where each row corresponds to an individual which was excluded from one of the two datasets.NEW ROOT is a 0 if the individual was missing from the new calls, and a 1 if he/she was included.OLD ROOT follows this format as well.

3.6 Visualization

After genotyping your data, you may wish to visualize the clustering methods BIRDSUITE usesto call the genotypes. To do this, we recommend a software called Evoker (Morris et al. 2010).The source code and binaries for Evoker can be easily downloaded at the following link:

https://github.com/wtsi-medical-genomics/evoker/releases

Evoker allows a user to view the intensity plot of any given SNP for all individuals, colorcoded by genotype. It requires the PLINK formatted .bed, .bim, and .fam dataset producedby the pipeline above and [FILE].allele summary, which is an output file from BIRDSUITE.After running our genotyping pieline, both of these input should be easily accessible in theuser-specified output directory. For further details, visit the link above for full documentation.

4 Quality-Control Pipeline

We have implemented a pre-quality control and quality control pipeline that applies standardautosomal GWAS quality control steps as implemented in PLINK (Purcell et al. 2007) andSMARTPCA (Price et al. 2006), as well as procedures that are specific to X. It is recommended(not required) to perform imputation on the dataset using IMPUTE2. A feature of IMPUTE2

(Howie et al. 2009) is to assume that the effective population size on chromosome X is 25% lessthan on the autosomes, and we recommend using this feature. Based on the output of IMPUTE2,we recommend excluding variants with an imputation quality < 0.5. In this version, we haveincorporated the imputation pipeline. Please refer to Chapter 4 for more detail. We have pro-vided an excerpt of data from HapMap (International HapMap Consortium. 2003) project as ourexample.

In the following quality control steps, the dataset is initially split into two datasets, one consistingonly on males and the other on females. Then, the General Quality Control steps are performed

6

https://github.com/wtsi-medical-genomics/evoker/releases

https://hapmap.ncbi.nlm.nih.gov/

separately on each of the two datasets, which are then merged, before additionally applying theX-specific Quality Control. Note: all P -value thresholds in the QC procedure are divided bythe number of SNPs in the dataset (Bonferroni-corrected) when determining exclusion criteria.

4.1 Pre-QC

The XWAS software treats individuals without parental information as “founder”, and thosewith parental information as “non-founder”, regardless of whether or not the parents are includedin the sample. When running the software, “non-founder” will be removed by default. To avoidthis behavior, there are two options: 1) revise the XWAS QC pipeline to add --nonfounders toall XWAS command line; 2) revise your input file and set all parental information to be missing,“0”. NOTE: you have to make sure that all individuals included in the analysis are independent.Parent-child or siblings should be treated so that only one of them is retained.

Before feeding data to the XWAS QC pipeline, we need to pre-process the data so that interde-pendent individuals are removed. Specifically, for parent-child pairs, only parents are retainedand the children are removed. For siblings, only one is arbitrarily chosen and the rest are re-moved. All individuals included in the study are independent, from the standpoint of the knownfamilial relationship (Note that the relationship will be further identified by the IBD-basedanalysis). The final step is to remove all parental information in the *.fam file by setting thecorresponding fields to 0. In this manner, all samples are considered as “founder” in the XWASQC pipeline.

4.2 General Quality Control

1. SNPs (single nucleotide polymorphisms) are filtered if they have a missingness rate abovesome threshold (recommended is 10%), if they have a minor allele frequency (MAF)below some threshold (recommended is 0.005, i.e. 0.5%), or if they are not in Hardy-Weinberg equilibrium in females (the recommended P -value threshold is 0.05). Note: allP -value thresholds in the QC procedure are divided by the number of SNPs in the dataset(Bonferroni-corrected) when determining exclusion criteria.

2. Variants are filtered if their missingness is significantly correlated with phenotype (rec-ommended P -value threshold is 0.05). Note: this step is only applied to case-controlstudies.

3. Individual samples are removed if they are inferred to be related based on the proportionof shared Identity-by-Descent segments (recommended is 10%). Identity-by-Descent iscalculated using the --genome flag in PLINK. To avoid removing too many samples, onlyone individual from each pair of related individuals is removed. *Note: There may besome issues with PLINK relatedness filtering method in removing more individuals thanexpected.

4. Individuals are removed if they have a genotype missingness rate above some threshold(recommended is 10%) or if their reported sex does not match the heterozygosity ratesobserved on X. IMPORTANT: The relatedness filtering analysis based on shared IBDsegment assumes that the samples are from a homogeneous population. If the samples arefrom different populations, this analysis will be problematic.

7

5. Population structure is assessed with the software SMARTPCA from the package EIGENSTRAT(Price et al. 2006) and outlier individuals are removed.

4.3 X-Specific Quality Control

1. Variants on chromosomes other than X are removed, as well as variants in the pseudoau-tosomal regions (PARs) on X.

2. Variants are filtered if they have significantly different MAF between male and femalecontrols (recommended P -value threshold is 0.05).

3. Variants are filtered if they have significantly different missingness between male and femalecontrols (The recommended P -value threshold is 0.05).

4.4 Specifying the QC Parameters

The necessary information should be recorded in a text file (for example, params qc.txt),with the following parameters, where every parameter is listed on a separate line in the format“parameter name [value]”. An example file with default values (example params qc.txt) isgiven in the folder ./example/qc, along with resulting output. This example performs bothpre-qc and qc steps for you.

filename [Name of dataset file (without the extension)]

xwasloc [Location of XWAS executable]

eigstratloc [Location of the SMARTPCA and convertf executables]

exclind [0/1] This parameter allows you to exclude a predefined set of individuals. Tospecify individuals to be removed, set the value to 1 and list individuals in a file namedaccording to the format: $filename exclind.remove, where the $filename variable isthe same as the filename parameter. Otherwise, set the value to 0.

excludexchrPCA [YES/NO] Set value to YES to exclude X data when calculating PCA,or NO to include it.

build [17/18/19] Specify the genome build of the dataset. Currently available optionsare 17, 18 or 19 for hg17, hg18 and hg19 respectively.

alpha [alpha value for statistical significance, i.e. 0.05] This is used in determiningBonferroni-corrected P -value for exclusion criteria

plinkformat [ped/bed] Specify whether data is in ped/map or binary format

maf [Cutoff for removing variants, i.e. 0.005] Variants with minor allele frequency lessthan this value are removed.

missindiv [Cutoff for removing individuals with missing data, i.e. 0.10] Individualswith missing genotype rate greater than cutoff are removed.

missgeno [Cutoff for removing variants with missing data, i.e. 0.10] Variants withmissing genotype rate greater than cutoff are removed.

8

numpc [Number of PCA principal components to include as covariates, i.e. 10]

related [Cutoff for removing individuals based on relatedness, i.e. 0.125, which corre-sponds to the relatedness between first-degree cousins] Individuals with a proportion ofshared IBD segments above the cutoff proportion are considered for removal.

quant [0/1] Specify whether the phenotype is quantitative (1) or binary (0)

Once the quality control variables have been set in the parameter file, the pipeline, includingboth the pre-qc step and general qc step, can be run using the following command.

$path/run QC.sh params qc.txt

The variable $path denotes the location of the shell script run qc.sh. The output dataset filesafter QC are named as $filename.preprocessed final x (with suitable extensions), whichshould now only contain data for X.

In addition to the parameter file, a few select flags and options, including saving logs, inter-mediate files, and debug information, can be directly passed to the QC script. Please see$path/run QC.sh --help and the example for more usage details.

The package provides and makes use of a compiled SMARTPCA executable in the /bin folder.If you encounter errors running the executable, try downloading the source code for EIGENSOFTfrom the the Price Lab at the following URL and compiling locally.

http://www.hsph.harvard.edu/alkes-price/software/

5 Imputation

We implemented a pipeline that performs X imputation with IMPUTE2 (Howie et al. 2009) using1000 Genomes Phase III reference data.

5.1 Data Preparation

The 1000 Genomes Phase III reference files are in build human genome 19 (hg19). Additionally,IMPUTE2 requires all data to be on the positive strand alignment. Before running imputation,it can be beneficial to ensure your data is also in build hg19 and on the positive strand. If youalready know the build and strand alignment of your dataset, feel free to skip this section. To helpwith this, our package includes a script called check genome build and strange alignment.pl.This script is located in XWAS/bin. It uses the X chromosome to perform position based andSNP ID based allele concordance checks. The script takes three arguments:

1. FILE.bim: The PLINK format .bim file for your dataset.

2. 1000GP Phase3 chrX NONPAR.legend: The 1000 Genomes Phase III reference file providedby us. Refer to the section below titled Reference Files Preparation on how to obtain thisfile.

3. OUTPUT.txt: The desired output file you wish to write your results to.

9

http://www.hsph.harvard.edu/alkes-price/software/

The script can be invoked by the following command:

perl check genome build and strange alignment.pl FILE.bim

1000GP Phase3 chrX NONPAR.legend OUTPUT.txt

The script will produce two output files: OUTPUT.txt and OUTPUT.txt.inconsistent.

The first file, OUTPUT.txt, summarizes the results of the concordance checks. It first lists thechromosome used for testing (with the 1000 Genomes reference file, this is chromosome X) andthe number of SNPs which were dropped from testing because they had only one allele. Thenext section summarizes the position-based concordance check. It lists the total number of SNPpositions in FILE.bim, the number of SNP positions in FILE.bim which were also found in thereference file, and the number of times that a given position’s allele matched in FILE.bim andthe reference file (this value is also given as a percentage). If the number of SNP positions foundin the reference file poorly matches the total number of SNP positions in FILE.bim (below 95%find rate), this is an indicator that your dataset is likely not in build hg19. Utilizing the pro-gram liftOver can encode your dataset in hg19. If the percentage of matching alleles is below95%, this is an indicator that your dataset is likely not all on the positive strand as it shouldbe for IMPUTE2. The final section summarizes the SNP ID-based concordance check. It can beinterpreted the same way as the position-based results, only instead of using SNP positions tomatch alleles, it relies on SNP IDs.

Will Rayner at the Wellcome Trust Centre for Human Genetics has provided a number ofuseful scripts which can fix strand alignment. These can be found on his website at http:

//www.well.ox.ac.uk/~wrayner/strand/.

The second output file, OUTPUT.txt.inconsistent, has a row for each SNP which was in-consistent between FILE.bim and the reference file. Each row has two columns, the position orSNP ID of the inconsistent SNP, and a column which indicates whether the inconsistency wasposition based or SNP ID based.

5.2 Reference Files Preparation

The 1000 Genomes Phase III reference files for X can be downloaded from https://mathgen.

stats.ox.ac.uk/impute/1000GP_Phase3_chrX.tgz. After downloading, type

tar -zxvf 1000GP Phase3 chrX.tgz

gzip -d 1000GP Phase3 chrX NONPAR.hap.gz

to uncompress the downloaded file. The files that are needed for imputation are1000GP Phase3 chrX NONPAR.hap, genetic map chrX nonPAR combined b37.txt and our up-dated 1000GP Phase3 chrX NONPAR.legend.gz file, which can be easily retrieved from http://

keinanlab.cb.bscb.cornell.edu/data/xwas/1000GP_Phase3_chrX_NONPAR.legend.gz. Pleaseunzip it once downloaded. Please move these three files into a common directory that stores allthe reference files for IMPUTE2. In the provided example, the directory is ./imputation pipeline/

imputation reference files. All other files generated by uncompressing 1000GP Phase3 chrX.tgz

will not be used and can be deleted.

10

http://www.well.ox.ac.uk/~wrayner/strand/

http://www.well.ox.ac.uk/~wrayner/strand/

https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3_chrX.tgz

https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3_chrX.tgz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/1000GP_Phase3_chrX_NONPAR.legend.gz

http://keinanlab.cb.bscb.cornell.edu/data/xwas/1000GP_Phase3_chrX_NONPAR.legend.gz

5.3 Parameter File

The parameters for imputation should be provided in a parameter file with an identical structureto example.par under the folder ./imputation pipeline. The content of file is explained asfollows:

1. The file starts with explanatory comments which have a # sign at the beginning of eachline. This comment is not required.

2. FILE: Name of pre-imputation PLINK files, WITHOUT extension.

3. OUTPUT: Name of the output file defined by the user, WITHOUT extension.

4. NJOBS: Number of parts to break down chrX. These jobs can be run sequentially or inparallel. Default value is 31.

5. BUILD: Build of the data (17 for hg17 and 18 for hg18).

6. FILELOC: Location of the pre-imputation PLINK files.

7. REFLOC: Location of 1000G reference files (1000GP Phase3 chrX NONPAR.hap,genetic map chrX nonPAR combined b37.txt and 1000GP Phase3 chrX NONPAR.legend).

8. TOOLSLOC: Location of imputation tools (./imputation pipeline/imputation tools forthis package).

9. RESLOC1: Intended location to store the output files after step 1 (Please make sure thatthis folder exists).

10. RESLOC2: Intended location to store the output files after step 2 (Please make sure thatthis folder exists).

11. FINALRESLOC: Intended location to store the final results (Please make sure that this folderexists).

12. MAFRULE: Population name of the samples and the corresponding minor allele frequency forfiltering out the sites in the 1000 Genomes reference file with a MAF ≤ this value in thatpopulation. The value must be one of: AFR.MAF (Africa), AMR.MAF (America), EAS.MAF(East Asia), EUR.MAF (Europe), SAS.MAF (South Asia), ALL.MAF (all above popuationstogether), combined with ≤ VAL. For example, if the sample is from European populationand the minor allele frequency threshold is 0.005, the value will be EUR.MAF ≤ 0.005.

NOTE: For file locations, make sure to include the whole path, including the ending “/”.

5.4 Usage

After putting the par file under the same folder as make imputation files.py, please take thefollowing steps to execute the imputation pipeline:

1. In the command line, type python make imputation files.py FILE.par. This will gen-erate all the necessary shell scripts for imputation.

11

2. Type ./FILE preimpute.sh. This will perform the pre-imputation step. The results afterthis step will be located in RESLOC1.

3. There are two options for the imputation step: (1) If you choose to run all of the jobs insequence, directly type ./FILE impute2 run all.sh; (2) If you choose to run the jobs inparallel, type each command in the file FILE impute2 run all.sh in a separate commandwindow. The results after this imputation step will be located in RESLOC2.WARNING: if you run the imputation pipeline in a local machine, running concurrentjobs jobs may be very memory-consuming! It is recommended to run the jobs one afterthe other on a local machine.BENCHMARK: On a 8 cores, 7.8 GB RAM, ubuntu 14.04LTS machine, to run on asample of 198 individuals, 32526 SNPs data that is divided into 30 jobs, on average ittakes approximately 20 minutes to run each job with approximately 1 GB memory.

4. Type ./FILE impute2 cat.sh. This step combines the files from different jobs in theimputation step. The results after this step will be located in FINALRESLOC, which includethe PLINK files after imputation for conducting next-step QC and the info files for reference.

6 Post Imputation Quality Control

If you have run the imputation step with our pipeline, we recommend complete a post imputa-tion quality control step. This step resembles the quality control described above. However, itdoesn’t include pre-qc step. The necessary information should be recorded in a tex file similarto the one used in the quality control step, with the same set of parameters, except the genomebuild of the dataset. After the imputation, field build should be 19. The filename might alsoneed to be changed according to the name of dataset file after IMPUTE2. Once the post imputa-tion quality control variables have been modified, it can be run using the following command.

$path/xwas qc post imputation.sh params qc.txt

7 Association Testing with XWAS

The XWAS software contains new functions for conducting X-wide association studies while keep-ing all previous functions that are provided by the commonly-used PLINK. XWAS provides 3 newC++ files (xoptions.h , xoptions.cpp, xfunctions.cpp), as well as minor changes to sourcecode from PLINK. The newly-added functionality is described below:

--xhelp Shows brief descriptions of all the XWAS-specific functions.

--xwas In order to use any of the options below, this option needs to be called.

--freq-x Outputting the allele frequencies for each polymorphic site in males and femalesseparately.

--freqdiff-x α Tests for significantly different minor allele frequency between maleand female controls, according to P -value threshold α. This option is used within the QCscript to filter out SNPs that have significantly different minor allele frequencies betweenmales and female controls.

12

--strat-sex Sex-stratified test, which first carries out an association test in males andfemales separately and then combines the two test results to produce a final sex-stratifiedsignificance value for each SNP.

--fishers Combines p-values from sex-stratified test (--strat-sex) for males and fe-males using Fisher’s method (This is the default method for combining p-values).

--stouffers Combines p-values from sex-stratified test (--strat-sex) for males andfemales using Stouffer’s method. Our sample-size-based analysis of Stouffer’s methodwithin this software follows that of Willer et al.

--sex-diff Sex-difference test, which tests the difference in the effect size between malesand females for each SNP.

--var-het Variance-Heterogeneity test, which tests for X-linked association by lookingfor higher phenotypic variance in heterozygous females than homozygous females at eachSNP. This method is described in further detail in Ma et al.

--var-het-weight Weighted-Heterogeneity test, which tests for X-linked association bya weighted regression approach to account for the variance inflation. This method isdescribed in further detail in Ma et al.

--var-het-comb Combined test of variance and weighted association by Stouffer’s ap-proach, which combined the variance-based test and the weighted association test intoa single test statistic using the Stouffer’s Z score method. This method is described infurther detail in Ma et al.

--xbeta This option can accompany --strat-sex or --sex-diff to output regressioncoefficients (in addition to the default odds ratios) for each sex.

--clayton-x Test for association on X chromosome using Clayton’s method under theassumption that the allele frequency does not vary with sex. This method is described infurther detail in Clayton.

--run-all This option conveniently runs sex-stratified test (combines p values from sex-stratified test with Fishers and Stouffers methods), sex-difference test, variance-heterogeneitytest, Clayton’s test for association on X chromosome together.

--xepi Runs the epistasis test, which tests SNP× SNP epistasis for qualitative/quantitativephenotypes. This function is modified from the original –epistasis in PLINK. For qualita-tive phenotypes, --xepi is essentially the same as --epistasis. Both use a logistic modeland test the interaction term with a Wald test. For quantitative phenotypes, --epistasisin PLINK uses a linear model with a Wald test, while --xepi uses a linear model andthen tests the interaction with a t test.Use the command: ./bin/xwas –xwas –bfile yourData --xepi, which will send output tothe files:

xwas.xepi.cc NOTE: cc:case/control; for quantitative traits, replaced by qt.xwas.xepi.cc.summary

Note that cc represents case/control, and it will be replaced by qt for quantitative traits.

13

http://bioinformatics.oxfordjournals.org/content/26/17/2190.full

http://www.biomedcentral.com/1471-2164/16/241



http://www.ncbi.nlm.nih.gov/pubmed/18441336

This file records the pairwise p value. The file xwas.xepi.cc.summary records the sum-mary information for each SNP.

For quantitative phenotypes, there will be another output file xwas.xepi.qt.summary.corrthat records the correlation between SNP based interaction test. It is used in the gene-based gene-gene interaction test.

There are four different modes for --xepi to assign which SNPs are tested:

1. ALL × ALL e.g. bin/xwas xwas --bfile yourData --xepi --out result

2. SET1 × SET1 (where set.file contains only 1 set) e.g. bin/xwas --xwas --bfile

yourData --xepi --set set.file --out result

3. SET1 × ALL (where set.file contains only 1 set) e.g. bin/xwas --xwas --bfile

yourData --xepi --set set.file --set-by-all --out result

4. SET1 × SET2 (where set.file contains 2 sets) e.g. ../bin/xwas --xwas --bfile

yourData --xepi --set set.file --out result

--xepi can be combined with the following commands:--set {set.file} There can only be one or two sets in the set file for --xepi. To specifysets of SNPs to be tested, set.file should be in the following format:

SET Ars001rs002END

SET Brs101rs102rs103END

-- set-by-all Specifies the SET1 × ALL mode for --xepi-- covar {c.file} Supports the inclusion of one or more covariates for --xepi.-- xepi1 {1} Sets the threshold for SNP pair p-value, only output tests with p-valuesless than the threshold (default is 1e-4)-- xepi2 {1} Sets the threshold for significant --xepi tests in .summary file. Default is1e-4.

In addition to these new functions, the options below, which have already been implemented inPLINK, can be useful in carrying out X-wide association studies.

--logistic, --linear The options to carry out logistic regression and linear regressionfor binary and quantitative traits respectively. Note that the genotypes of males will follow0/1 coding by default, i.e. a male allele is considered as equivalent to a single female copy.

--xchr-model 2 This option can be used together with --logistic or --linear tocode genotypes of males as 0/2, i.e. males are always considered equivalent to eitherfemale homozygote. When relevant, the above newly-added tests can also consider either0/1 or 0/2 coding using this option.

14

7.1 Examples

For examples on running the functions above, please execute the following:

cd tests

make file

# Get allele frequencies for Males and Females Separately

../bin/xwas --bfile dummy case1 --xwas --freq-x

# Perform the sex-stratified test, using Fisher’s method to combine p-values

../bin/xwas --bfile dummy case1 --xwas --strat-sex --fishers

# Perform the sex-stratified test, using Stouffer’s method to combine p-values

../bin/xwas --bfile dummy case1 --xwas --strat-sex --stouffers

# Perform the sex-difference test

../bin/xwas --bfile dummy quant1 --xwas --sex-diff

# Perform the variance-heterogeneity test (with covariates)

../bin/xwas --bfile dummy quant1 --xwas --var-het --covar dummy quant1.covar

# Perform linear regression using 0/2 coding for males

../bin/xwas --bfile dummy quant1 --xchr-model 2 --linear

The output files will all have the prefix xwas and be in the format of column-delimited tables.For example, one of the output files is xwas.xstrat.logistic. This file contains columns thatdetail the SNP identifier, the Stouffer’s Z-score, and the overall P -value of the test (among otherdata) for each SNP.

To verify that the executable is built correctly one can execute the command make all in thetests folder, which will both create datasets as well as execute test xwas.py which providesa comprehensive test of all of the commands and options available in XWAS. For the user’s con-venience, all of the commands that are executed within the testing sequence are printed so theuser can replicate them on their own data.

8 Gene-based Association Testing on X

The gene-based association testing is based on the VEGAS (Liu et al. 2010) framework. Here,the procedure is modified to utilize the truncated tail strength (Jiang et al. 2011) and truncatedproduct (Zaykin et al. 2002) method to combine individual SNP-level P -values. In order to runthe gene-based test, the following two packages must be installed in R: corpcor and mvtnorm.One way to install such packages is from within R’s command line by typing the following com-mands.

install.packages("corpcor")

install.packages("mvtnorm")

8.1 Automated Gene Based Association Tests

Gene-based testing builds upon the SNP-level analyses shown previously by using the P -valuesobtained for each SNP. For your convenience, we have provided a script that automaticallysearches for the temporary files our XWAS software has created when you run the SNP-levelanalyses, and generates the gene based results for you. The script for the automated gene based

15

testing requires several input files to run:

1. A dataset in binary PLINK format (bed/bim/fam).

2. A file with a list of genes to test. The gene list file should contain one gene per line and thefollowing information in the listed order:

[chromosome #] [gene starting position] [gene end position] [gene name]

3. A file (params genes auto.txt) listing all of the parameters necessary to run the script. Anexample (example params gene test auto.txt) is available in the folder example/gene automated.The parameters required are:



genescriptloc [Location of the provided R script truncgene.R]

genelistname [Location and filename of a list of genes to test]

pvfolder [Location of the association results files]

buffer [Integer specifying the buffer region around genes (in base-pairs)] We determinea buffer region so that we do not exclude SNPs that may be slightly outside the definedgene region

Note that the output file will be in the directory of the association result files.After installing the required R packages and assembling all the required files, the automatedgene-based test can be run using the following command:

$path/gene based test automated.sh params genes auto.txt

The output of the gene-based association test will have the following columns:

[gene] [reps] [tail p value] [prod p value]

The first column states the name of the gene. The second is the number of (adaptively deter-mined) bootstrap replicates. The third column is the P -value for the gene using the truncatedtail strength method. The last column shows the P -value calculated using the truncated productmethod.

8.2 Single Gene Based Association Test

Gene-based testing builds upon the SNP-level analyses shown previously by using the P -valuesobtained for each SNP. The script for each gene based testing requires several input files to run:

1. A dataset in binary PLINK format (bed/bim/fam).

2. A file with a list of genes to test. The gene list file should contain one gene per line and thefollowing information in the listed order:

16

[chromosome #] [gene starting position] [gene end position] [gene name]

Gene positions in this file should correspond to the same build as referenced in the binary PLINK

format files. We recommend using the “UCSC knownCanonical” transcript file for a compre-hensive list of genes on chromosome X. We have placed correctly formatted compressed genelists on chromosome X for human genome builds hg19 and hg38 in the genes directory.

3. A file with association statistics for SNPs, with one SNP per line:

[SNP name] [SNP P-value]

These P -values can be obtained from any of the SNP-based tests mentioned in the previoussection (i.e. the sex-stratified test –strat-sex ). This input file will consist of the two columnsrepresenting the SNP name and P -value from the output files of the SNP-based tests.

4. A file (params genes.txt) listing all of the parameters necessary to run the script. Anexample (example params gene test.txt) is available in the folder /example/gene test. Theparameters required are:



genescriptloc [Location of the provided R script truncgene.R]


assocfile [Location and filename of the association results file]


numindiv [Number of individuals with which to estimate linkage disequilibrium for es-timating dependency between single-SNP tests] We recommend that this number is aminimum of 200.]

output [Location and filename of the output file]

After installing the required R packages and assembling all the required files, the gene-based testcan be run using the following command:

$path/gene based test.sh params genes.txt

The variable $path denotes the location of the shell script gene based test.sh. Note that youneed to run the script in the directory of your data files. For a full example of running thegene-based tests, execute:

/example/gene test/run example gene test.sh params genes.txt

The output of the gene-based association test will have the following columns:

17

[gene] [reps] [tail p value] [prod p value]

The first column states the name of the gene. The second is the number of (adaptively deter-mined) bootstrap replicates. The third column is the P -value for the gene using the truncatedtail strength method. The last column shows the P -value calculated using the truncated productmethod.

9 Gene-based Gene-gene Interaction Testing

The gene-based gene-gene interaction (GGG) tests are based on a framework we previously de-veloped (Ma et al. 2013). The tests combine SNP-based interaction tests between all pairs ofSNPs in two genes to produce a gene-level test for interaction between the two. There are fourGGG tests that extend the following P value combining methods: minimum P value, extendedSimes procedure, truncated tail strength (Jiang et al. 2011), and truncated P value product(Zaykin et al. 2002). Note that currently there is a known issue with the latter two methods. Thetail p value (truncated tail strength) and prod p value (truncated P value product) may beinflated in some cases. Currently, the GGG tests can only deal with quantitative phenotypes.In order to run the GGG tests, the following two packages must be installed in R: corpcor andmvtnorm. Please refer to the previous chapter for installation instruction.

GGG tests are built upon the SNP-level interaction test function --xepi, and use the P valuesobtained for each SNP pair. The script requires several input files to run:

1. A dataset in binary PLINK format (bed/bim/fam)

2. A file with a list of genes to test interaction for each pair of gene. The gene list file shouldcontain one gene per line and the following information in the listed order:

[chromosome #] [gene starting positio] [gene end position] [gene name]

3. A file (params genes.txt) listing all of the parameters is necessary to run the script. An ex-ample (example params gene test.txt) is available in the folder /example/gene gene inter. Theparameters required are:



genescriptloc [Location of the provided R script gene based inter.R]


genelistname2 [Location and filename of a second list of genes (optional)]


covarfile [Name of covariant file (optional)]

output [Location and filename of the output file]

18

http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003321


http://statgen.ncsu.edu/zaykin/tpm/tpmfinal.pdf

Similar to xepi, the gene-based interaction test can be run in a few different modes. By de-fault, it performs interaction tests on all pairs of genes in list1. If given the optional argumentgenelistname2, it will perform SET1 × SET2. To run the GGG test on all X chromosomegenes, unzip a list of the genes from the /genes/ directory using gzip -c > Xgenelist.txt.

After installing the required R packages and assembling all the required files, the gene-basedtest can be run using the following command:

$path/gene based interaction.sh params genes.txt

The variable $path denotes the location of the shell script gene based interaction.sh. For afull example of running the GGG tests, execute:

/example/gene gene inter/run example GGG test.sh example params GGG test.txt

The output of the GGG test will have the the two kinds of files:

1. genepair results.txt, which has the following columns:

[gene1] [gene2] [min P value] [gate P value] [tail p value] [prod p value]

The first and second columns state the names of two genes. The third through sixthcolumns record P values for each method: minimum P value, extended Simes procedure,truncated tail strength, and truncated P value product.

2. gene1name gene2name.qt The SNP-based interaction result from the xepi function foreach pair of genes. The file formats refers to the --xepi function.

10 Troubleshooting

The Keinan Lab is continually developing XWAS and new features/fixes will be distributed often.Strongly recommend to join our mailing-list for updates and announcements by signing up onour website http://keinanlab.cb.bscb.cornell.edu/content/xwas. Please report bugs orask questions by sending an email to [email protected]. We are also collaboratingwith several groups by running desired features of XWAS on their data, and we welcome additionalsuch collaborations. Please email Alon Keinan ([email protected]) regarding collaborationopportunities.

Below are some helpful tips for troubleshooting if XWAS is not running correctly for you:

1. XWAS requires the newer liblapack.so.3 in place of the older liblapack.so.3gf library.Run ldconfig -p | grep liblapack to locate which version is currently installed. InDebian based linux distros, liblapack.so.3 can be installed from the liblapack3 pack-age, while the lapack package provides it in rpm based systems.

2. When combining or comparing datasets, it can be important to know which build yourdataset is in and whether you have agreeing strand alignment for your datasets. Youcan check genome build and strand alignment by using a helpful script that our pack-age includes called check genome build and strange alignment.pl. See the imputationsection of this manual for more details on how to use and interpret this script.

19

http://keinanlab.cb.bscb.cornell.edu/content/xwas

http://keinanlab.cb.bscb.cornell.edu/content/xwas

mailto:[email protected]?subject=XWAS%20Troubleshooting

mailto:[email protected]

3. Our binary package is optimized for Linux. If you are attempting to run XWAS onWindows or Mac OS, please download our source code and alter the Makefile for yourappropriate system.

4. Because our software is built off PLINK, many PLINK features and flags carry over toXWAS. If you wish to run a PLINK test that you do not see detailed in this manual,first try running the test as detailed in PLINK’s documentation. If this does not work asexpected, contact us for more details.

For further questions, please contact [email protected].

11 References

1. Chang D, Gao F, Slavney A, Ma L, Waldman YY, Sams AJ, Billing-Ross P, Madar A,Spritz R, Keinan A. 2014. No eXceptions: Accounting for the X chromosome in GWASreveals X-linked genes implicated in autoimmune diseases. PLoS ONE 9(12): e113684.

2. Clayton D. 2008. Testing for association on X chromosome. Biostatistics 9(4):593-600.

3. Gao F, Chang D, Biddanda A, Ma L, Guo Y, Zhou Z, Keinan A. 2015. XWAS: a toolsetfor chromosome X-wide data analysis and association studies. Journal of Heredity.

4. Howie BN, Donnelly P, Marchini J. 2009. A flexible and accurate genotype imputationmethod for the next generation of genome-wide association studies. PLoS Genetics 5:e1000529.

5. International HapMap Consortium. 2003. The International HapMap Project.Nature,426(6968): 789-96.

6. Jiang B, Zhang X, Zuo Y, Kang G. 2011. A powerful truncated tail strength method fortesting multiple null hypotheses in one dataset. J Theor Biol, 277, 67-73.

7. Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, Hubbell E,Veitch J, Collins PJ, Darvishi K, Lee C, Nizzari MM, Gabriel SB, Purcell S, Daly MJ,Altshuler D. 2008. Integrated genotype calling and association analysis of SNPs, commoncopy number polymorphisms and rare CNVs. Nature Genetics, 40, 1253-1260.

8. Liu JZ, MaRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, AMFS Investigators,Hayward NK, Montgomery GW, Visscher PM, Martin NG, Macgregor S. 2010. A versatilegene-based test for genome-wide association studies. Am J Hum Genet, 87, 139-145.

9. Ma L, Hoffman G, Keinan A. 2015. X-inactivation informs variance-based testing forX-linked association of a quantitative trait. BMC Genomics. 16:241.

10. Morris JA, Randall JC, Maller JB, Barrett JC. 2010. Evoker: a visualization tool forgenotype intensity data. Bioinformatics (Oxford, England), 26.14, 1786-1787.

11. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. 2006. Principalcomponents analysis corrects for stratification in genome-wide association studies. NatGenet, 38, 904-909.

20

mailto:[email protected]?subject=XWAS%20Troubleshooting











http://www.nature.com/ng/journal/v40/n10/abs/ng.237.html

http://www.nature.com/ng/journal/v40/n10/abs/ng.237.html



http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1463-y

http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1463-y

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3073232/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3073232/

http://www.nature.com/ng/journal/v38/n8/full/ng1847.html

http://www.nature.com/ng/journal/v38/n8/full/ng1847.html

12. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, SklarP, de Bakker PI, Daly MJ, Sham PC. 2007. PLINK: A Tool Set for Whole-GenomeAssociation and Population-Based Linkage Analyses. Am J Hum Genet, 81, 559-575.

13. Willer CJ, Li Y, Abecasis GR. 2010. METAL : fast and efficient meta-analysis of genomewideassociation scans. Bioinfomatics, 26, 2190-2191

14. Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. 2002. Truncated product method forcombining P-values. Genet Epidemiol, 22, 170-185.

21







xwas (version 2.0): a toolset for chromosome x-wide data...

Documents