bovine gwas with mixed linear model tools · bovine gwas with mixed linear model tools, release...

Bovine GWAS with Mixed Linear ModelTools

Release 8.7.0

Golden Helix, Inc.

Feb 14, 2019

Contents

Sample and Marker Quality Assurance 2Filtering Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Filtering Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3LD Pruning Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Cryptic Relatedness 6

Population Stratification 11

Preparing Data and Running Analysis 14Preparing Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Running the Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Single-Locus Mixed Model 17

Multi-Locus Mixed Model Analysis 23MLMM Step Information and Covariate P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23P-Values from Multi-Locus Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

i

Bovine GWAS with Mixed Linear Model Tools, Release 8.7.0

Updated: January 9, 2019

Level: Fundamentals

Version: 8.7.0 or higher

Product: SVS

The following tutorial is designed to systematically introduce you to a number of techniques for genome-wide asso-ciation studies. The dataset included in the tutorial is from a Bovine HapMap dataset genotyped using an IlluminaArray. For the association study we will be using the Mixed Linear Model tool that is available in SVS.

This tutorial is not meant to replicate all the workflows you might use in a complete analysis, but instead touch on asampling of the more typical scenarios you may come across in your own studies.

All phenotype data is simulated.

Requirements

To follow along you will need to download and unzip the following file:

Download

Bovine_GWAS_Tutorial.zip

We hope you enjoy the experience and look forward to your feedback.

Contents 1

http://doc.goldenhelix.com/SVS/tutorials/bovine_gwas_mlmm/Bovine_GWAS_Tutorial.zip

Sample and Marker Quality Assurance

In the following sections we will filter the data removing samples with poor call rate as well as filter markers for callrate, number of alleles, and minor allele frequency. Then finish up with LD Pruning the data in preparation for crypticrelatedness and population stratification checks.

• From the SVS Welcome screen go to File > Open Project and navigate to the downloaded project and selectthe Bovine_GWAS_Tutorial.ghp file. The project should already contain two spreadsheets, Phenotypes andBovine HapMap Genotypes.

• Select Tools > Current Project’s Options and confirm that the correct genome assembly is selected. Theproject for this tutorial should have already been pre-set to use the Bos taurus (Cow), ARS UCD1.2 (Apr 2018)genome assembly. This option primarily affects the genomic view and which data sources are available in theintegrated GenomeBrowse plotting interface.

Note: For more details on creating and using genome assemblies in SVS please see the Creating and UsingGenome Assemblies tutorial.

Filtering Samples

• Open the Bovine HapMap Genotypes spreadsheet and choose Genotype > Quality Assurance and Utilities> Filter Samples by Call Rate.

• Select to Drop if call rate <= 0.99. Dialog should look like Figure 1-1.

Figure 1-1. Filter Samples by Call Rate Dialog

• Four samples should have been inactivated due to poor call rate and a subset spreadsheet called Subset - Sampleswith Call Rate > 0.99 should have been created.

2

http://doc.goldenhelix.com/SVS/tutorials/create_genome_assembly/index.html

http://doc.goldenhelix.com/SVS/tutorials/create_genome_assembly/index.html


Filtering Markers

• From the Subset - Samples with Call Rate > 0.99 spreadsheet go to Genotype > Genotype Filtering byMarker.

• Select to Drop if call rate < 0.85, Drop if number of alleles > 2, and Drop if Minor Allele Frequency (MAF) <0.01. The dialog should look like Figure 1-2. Click Run.

Figure 1-2. Filter Markers by Call Rate, number of alleles and MAF.

The results of the tool should be one spreadsheet, Filtering Results, which will contain two columns for each filteringcriterion–one column with the original dataset’s values for that criterion, and the other a binary column stating whethereach marker met that filter criterion. Additionally, an overall Drop? column will be output (at the beginning) that hasa 1 for every marker that will be filtered from the original dataset.

In the original genotype spreadsheet you will see that now some of the markers in the dataset have been inactivated.In particular, 3,567 columns were inactivated leaving 49,323 markers active.

• Create a subset spreadsheet by going to Select > Subset Active Data.

• Rename the spreadsheet to High Quality Samples and Markers by right-clicking on the spreadsheet node inthe project navigator and selecting Rename.

Filtering Markers 3


Figure 1-3. Results Spreadsheet for Marker Filtering.

Figure 1-4. Original Genotype Spreadsheet with Inactive Markers.

4 Sample and Marker Quality Assurance


LD Pruning Markers

Some tests such as Identity by Descent Estimation (IBD) and Principal Component Analysis (PCA) will obtain betterresults if the markers used are not in linkage disequilibrium with each other. So, before proceeding to the next sectionsof the tutorial we will LD Prune the markers.

• Open the High Quality Samples and Markers spreadsheet and go to Genotype > Quality Assurance andUtilities > LD Pruning.

• Leave the default options and click OK.

Figure 1-5. Default Options for LD Pruning.

Note: The options for pruning your markers are specified in the options dialog with the default options being themost common choices for basic pruning of the data. However, if your marker data is dense in some areas or you seelarge blocks of moderately high LD in other areas that you would like reduced, changing these default options mayimprove your results. See Determining the best LD Pruning options blog for assistance in selecting the best optionsfor your data.

This will take a couple of minutes to run. Upon finishing, 5,734 markers will be inactivated as designated by the grayedout columns in the spreadsheet. You will be using only the active columns (non-correlated markers) for autosomalchromosomes to perform IBD and PCA, so create a column subset spreadsheet.

• Choose Select > Column > Column Subset Spreadsheet.

• In the Project Navigator, rename this node to Pruned SNP Subset.

• Close all open spreadsheets by going to Window > Close All.

If your dataset contains non-autosomal chromosomes, you will want to inactivate the markers in these chromosomesbefore proceeding to IBD and PCA analysis.

• To inactivate these markers from the subset spreadsheet (if necessary), choose Select > Activate by Chromo-somes, uncheck the non-autosomal chromosomes (ex. X, Y, XY, Z, MT) and click OK.

LD Pruning Markers 5

http://blog.goldenhelix.com/jbartole/determining-best-ld-pruning-options/

Cryptic Relatedness

Next find and filter samples determined to be “related” to other samples. Relatedness is often defined as family-relatedness but identity by descent (IBD) estimation can also detect duplicate samples, duplicate samples from one ofa pair of genotyping chips but not the other, or sample contamination.

• From the Pruned SNP Subset spreadsheet, choose Genotype > Quality Assurance and Utilities > Identityby Descent Estimation.

• For this exercise, check Output IBS distances ((IBS 2 + 0.5*IBS 1)/# non-missing markers) and Output PI= P(Z=1)/2 + P(Z=2). Uncheck all other options, and click Run.

This will take a few minutes. Upon completion, two spreadsheets are output: IBD Estimate: Estimated PI and IBSDistance ((IBS2 + 0.5*IBS1)/# non-missing markers).

IBD Estimate: Estimated PI gives an N x N table where N is the number of samples in the dataset. By plotting aheatmap of this table we can detect patterns showing relatedness. IBS Distance ((IBS2 + 0.5*IBS1)/# non-missingmarkers) is also an N x N table that reflects the Identity by State (IBS) between pairs of samples, and is the recom-mended kinship matrix to be used with the Mixed Linear Model Analysis tool we will be using later in this tutorial.

• From the IBD Estimate: Estimated PI spreadsheet, choose Plot > Heat Map (Uniform).

You’ll get the plot in Figure 2-1.

By default, the heat map has a three color scheme calculated automatically. We want to define the color schememanually based on a two color scheme where we look for sample pairs with a PI estimate of 0.25 or greater (PI of 0.25= second degree relatives, 0.5 = first degree relatives, and 1 = identical twins (or duplicate samples)).

• Click on the IBD Estimate: Estimated PI node in the Graph Control Interface (upper-left portion of the win-dow).

• On the Color tab choose Manual and then right-click the first option (0), and select Delete.

• Now right click on the first parameter, select Edit, and change it to 0.2. Your plot should look like Figure 2-2.

You can see several rectangular areas in the plot that show a high degree of relatedness in those samples. You canzoom in to see this more clearly by clicking and dragging a red box around one of these rectangular areas (Figure 2-3).

This dataset contains samples from 19 different cattle breeds. Pairs of samples from within the same breed should bemore closely related to each other then a sample from one breed and a sample from a different breed. This is confirmedby the fact that the green rectangles you see along the diagonal in Figure 2-2 are areas where both samples are withinthe same breed.

Also, you might expect that pairs of samples within some breeds would be more highly related than pairs of sampleswithin other breeds. The fact that some rectangles are darker green than others confirms that this is the case for thisdataset.

6


Figure 2-1. Default heat map of IBD PI estimates

7


Figure 2-2. Heat map with two colors

8 Cryptic Relatedness


Figure 2-3. Zoomed in area around several related individuals

9


You can see a breakdown of the breed proportions for this dataset by opening the phenotype spreadsheet and creatinga pie chart from the breed variable.

• Open the Phenotypes spreadsheet.

• Right-click on the Breed column and select Pie Chart.

Figure 2-4. Pie Chart of Breeds

10 Cryptic Relatedness

Population Stratification

The next step is to identify samples that depart from the expected homogenous ethnicity of your study. You can dothis by performing principal component analysis on your data and comparing the first two principal components.

There are several ways to perform PCA. Some recommend using the pruned set of SNPs (as was done with IBD), somerecommend using a filtered set of SNPs (on minor allele frequency and HWE for example), and some recommend usingthe entire SNP set. There are advantages to each. Here we’ll use the pruned set of SNPs we created earlier.

• Open the Pruned SNP Subset spreadsheet and select Genotype > Genotype Principal Component Analysis.

• Under Principal Components, enter 10 for Find up to top ___ components.

• Leave the defaults for the rest of the options and click Run.

Two spreadsheets, the Principal Components (Additive Model) spreadsheet and the PC Eigenvalues (AdditiveModel) spreadsheet, result from the analysis. To find out how many principal components are required to explain themajority of the population stratification, a PCA plot will be created and the eigenvalues will be visually inspected.

Look at the PC Eigenvalues (Additive Model) spreadsheet. Notice that there is very little change between the seventh,eighth and ninth eigenvalues, implying that about 6 principal components explain the majority of stratification in theSNP data.

You can visualize the population stratification by plotting the first few principal components against one another. Wesuspect that the breed structure in the sample will explain the stratification we are seeing. To confirm this assumption,we will be coloring the PCA plot by breed. To be able to color by breed we will need to first join the Phenotypesspreadsheet with the Principal Components spreadsheet.

• Open the Phenotypes spreadsheet and select File > Join or Merge Spreadsheets.

• Select the Principal Components (Additive Model) spreadsheet.

• In the Join or Merge Spreadsheet window select the Current spreadsheet radio button under Spreadsheetas Child of, leave the rest of the parameters at their defaults, and click OK. The combined spreadsheet shouldlook like Figure 3-1.

It is now possible to plot one component against the other and color-code each sample or data point according to itsrespective breed.

• From the Phenotypes + Principal Components (Additive Model) - Sheet 1 spreadsheet, select Plot >XYScatter Plots.

The XY Scatter Parameters dialog appears with two list views. The list view on the left is for selecting the column(principal component) to represent the independent or X axis. The list view on the right is for selecting a single ormultiple columns (principal components) to represent the dependent or Y axis.

• In the left list box select EV = 31.2575. In the right list box check EV = 9.00414 and click Plot.

If we color each data point according to its respective breed, the clusters become more obvious.

11


Figure 3-1. Phenotypes added to the Principal Components spreadsheet

• In the Graph Control Interface in the upper-left pane of the Plot Viewer, select the Item EV = 9.00414.

• Select the Color tab and select the By Variable radio button

• Click Select Variable and select Breed from the list. Click OK.

You can see in the plot (Figure 3-2) that there are about 5 to 6 different clusters of samples, depending on whether youbelieve the Romagnola breed is on its own or a part of the larger group that includes the Piedmontese breed. Thisconfirms our assumption when examining the eigenvalues that about 6 principal components explain the majority ofthe stratification in the data.

• When finished, close the Plot Viewer and rename its associated node (under the Phenotypes spreadsheet) in theProject Navigator to PCA Plot.

12 Population Stratification


Figure 3-2. PCA plot colored by Breed

13

Preparing Data and Running Analysis

Preparing Dataset

To begin analysis we must first join together all the necessary spreadsheets.

• Open the High Quality Samples and Markers spreadsheet and reactivate all markers by going to Select >Activate All.

• This will create a new spreadsheet called High Quality Samples and Markers - Sheet 2. From this spreadsheetgo to File > Join or Merge Spreadsheets and select the Phenotypes spreadsheet.

• Set the New dataset name: to Phenotype + High Quality Data and choose Project root under the Spreadsheetas Child of option. Leave the rest of the options as default and click OK.

Running the Analysis

• Open the Phenotype + High Quality Data spreadsheet and left-click the Phenotype1 column label header.This will turn the column magenta denoting the column as the dependent variable.

• Choose Genotype > Mixed Linear Model Analysis.

Through the quality assurance parts of this tutorial we have learned that the dataset has a lot of cryptic relatedness aswell as population stratification due to the breed structure. For this type of data the Mixed Model approach is a perfectfit as we can adjust for relatedness with the random effects component of the model using a kinship Matrix (IBS) andalso include additional fixed effects (Breed) using the covariates options.

• On the MLM Parameters tab, check the Mixed Model GWAS group box and within these options selectboth Single-locus mixed model GWAS (EMMAX) and Multi-locus mixed model GWAS (MLMM) analysisoptions.

• Check the Use Pre-Computed Kinship Matrix (Cov. Matrix or Random Effects), and then click SelectSheet. Choose the IBS Distance... spreadsheet to be used as the kinship matrix.

Note: For this tutorial we have selected to use the IBS matrix for our kinship component as that is what wasrecommend by the authors of the EMMAX method. Other analysis tools may recommend using the IBD matrix oreven a GRM computed from another source (ex. GBLUP Genomic Relationship Matrix). The authors of the EMMAXmethod described why they believe the IBS matrix is the best choice for their method in a paper available on PubMedhere.

• Check Correct for Additional Covariates and then Add Columns. Choose the Breed column from the spread-sheet.

14

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3092069/


• Under the Additional Outputs tab, make sure the False discovery rate (FDR), Output data for P-P/Q-QPlots, Output -log10(P), and Allele Frequencies options are checked, and uncheck all other options.

• Once the dialog looks like Figures 4-1 and 4-2 and, click OK.

Figure 4-1. Mixed Model Options Dialog.

It will take a few minutes to finish as the Multi-locus portion needs to go through several iterations of the EMMAXalgorithm. Once it is finished you should get four resulting outputs from the analysis. We will look at each of these inthe following sections.

Running the Analysis 15


Figure 4-1. Mixed Model Additional Options Dialog.

16 Preparing Data and Running Analysis

Single-Locus Mixed Model

The P-Values from Single-Locus Mixed Model output is from a single EMMAX run that tested each active markerin the dataset. Significant p-values can be plotted in a Manhattan plot and Q-Q Plots can be created from the expectedvalues to test the fit of the model.

• To create a Manhattan Plot right-click on the -log10(P-Value) column and select Plot Variable in Genome-Browse.

• In the Plot Tree window, click either of the two -log10(P-Value) plot nodes, and under the Style tab of theControls window select Style By: Chromosome.

Figure 5-1. Manhattan Plot

17


You can zoom into the peak visible in chromosome 17 by selecting on the zoom toolbar, then pressing your leftmouse button to the left of the peak and while holding down your left mouse button dragging the cursor across to theright of the peak, finally releasing the mouse button to complete the zoom. Repeat this process until you are to thevisibility you prefer.

As soon as you get in far enough you should start to see the names of the markers become visible in the plot. For thisdataset we have two markers (HapMap57042-rs29016514 and ARS-BFGL-BAC-36625) that are highly significant inthis region with many supporting markers around them.

Since they are both behaving very similarly we suspect that they are in high LD with one another. To test out thistheory we can add an LD plot directly above the p-value plot.

• Click the Plot button in the upper left corner of the GenomeBrowse window.

• On the Add Data Sources dialog that appears, select the Project option on the left and the Bovine HapMapGenotypes spreadsheet in the Navigator Window Notes. Finally, check LD under the Plot Data options.

Figure 5-2. Adding LD Plot to GenomeBrowse Window

• Once the dialog looks like Figure 5-2, click Plot & Close to add the LD Plot to your existing GenomeBrowsewindow.

As we expected, the LD plot shows a large red area around our markers of interest indicating that they are in high LDwith each other.

• Clicking on the dark red rectangle on the LD Plot below the two markers of interest will show the computed LDvalue for those markers in the Console window.

• To test the fit of the model return to the P-Values from Single-Locus Mixed Model spreadsheet and choosePlot > XY Scatter Plots. Select -log10(Expected P) on the left side and -log10(P-value) on the right side, then

18 Single-Locus Mixed Model


Figure 5-3. Combined LD and P-value GenomeBrowse plots.

19


Figure 5-4. R-Squared Value between Markers.



click Plot.

• The 𝑦 = 𝑥 line can be added to the plot by clicking the Graph1 node in the Graph Control Interface, then on theAdd Item tab check the 𝑓(𝑥) = 𝑚𝑥+ 𝑏 item and click Add.

21


Figure 5-5. Expected vs. Observed -log10 P-values.


Multi-Locus Mixed Model Analysis

We will use the next three outputs created, which were P-Values from the Multi-Locus Mixed Model, MLMMStep Information and Covariate P-values, and Variance Partition Plot, to examine the Multi-Locus portion of theanalysis.

MLMM Step Information and Covariate P-values

For complex traits controlled by several large-effect loci, a single-locus test may not be appropriate, especially in thepresence of population structure. Therefore, the MLMM process is available, which is a simple stepwise mixed-modelregression that works as follows:

1. Begin with an initial model that includes, as its fixed effects, only the intercept and any additional covariatesyou may have specified.

2. Using this model, perform an EMMAX scan through all markers (that you have not specified as additionalcovariates).

3. From the markers scanned above, select the most significant marker and add it to the model as a new fixed-effectcovariate (“cofactor”), creating a new model.

4. Repeat (2) and (3) (forward inclusion) until a pre-specified maximum number of forward steps is reached oruntil forward inclusion must be stopped for some other reason (see note below).

5. For each selected marker in the current model, temporarily remove it from the fixed effects and perform anEMMAX scan over only that marker.

6. Eliminate, from the current model, the marker that came out as least significant using the above test. A newsmaller model is created.

7. Repeat (5) and (6) (backward elimination) until only one selected marker is left.

The variance components are re-estimated between each forward and backward step, while the same kinship matrix isused throughout the calculations.

Note: The following are reasons why forward inclusion might be stopped early:

• When the pseudo-heritability estimate has become close to zero.

• When all of the variance is explained by the fixed-effect covariate markers (cofactors) currently in the modeland any covariates you may have specified.

• When the latest significant marker that has been selected is collinear with the fixed-effect covariate markers(cofactors) already in the model and any covariates you may have specified.

• When the model now requires more active samples for analysis than exist in the spreadsheet.

• When the number of active (non-cofactor) markers is no longer enough to permit analyzing the current model.

23


• Open the MLMM Step Information and Covariate P-values spreadsheet to see the statistics for each of thesteps that were run on this dataset.

Figure 6-1. MLMM Step Information and Covariate P-values

The result of this stepwise regression is a series of models. Several model criteria have been explored by the authorsof this method to determine how appropriate any of the models is. See Model Criteria for a full description of eachcriterion.

• Examining the Optimal according to: column for this dataset you can see that the After Step 1 results are mostappropriate according to several model criteria.

• The variance of the phenotype for the model resulting from each step of the analysis is summarized in theVariance Partition Plot, as well as in the MLMM Step Information and Covariate P-values spreadsheet.

P-Values from Multi-Locus Mixed Model

From the MLMM Step Information and Covariate P-values output we learned that the After Step 1 portion of theoutput is optimal.

• Open the P-Values from Multi-Locus Mixed Model spreadsheet and scroll to the group of columns that beginwith P-Value (After Step 1) and end with Manhattan Category (After Step 1). For this example this groupincludes columns 10 through 18. If more output options had been selected on the Additional Output tab of theanalysis dialog, there would have been additional columns for each group.

• Right-click on the -log10(P-Value) column (11) and select Plot Variable in GenomeBrowse.

To color the plot in Manhattan Plot fashion we will use Manhattan Category (After Step 1) (Column 18).

• Click either -log10(P-Value) node in the Plot Tree, and on the Style tab of the Controls dialog pick ManhattanCategory (After Step 1).

The difference between using this categorical data and just coloring using a standard Manhattan color schema is thatthe SNPs added as covariates to the model (“cofactors”) are colored differently from the rest of the markers in the same

24 Multi-Locus Mixed Model Analysis

http://doc.goldenhelix.com/SVS/latest/svsmanual/mixedModelMethods/overview.html#model-criteria


Figure 6-2. Variance Partition Plot

Figure 6-3. After Step 1 Output

P-Values from Multi-Locus Mixed Model 25


chromosome, so that they can easily be differentiated. As seen in Figure 6-4 below, marker Hapmap57042-rs29016514is the covariate that was added to the Step 1 model.

Note: You may wish to scroll down to Cofactor in the Style: box and increase the plot point size of Cofactor to 10 or12 to better distinguish cofactor plot points, since the Cofactor plot point color otherwise may be hard to distinguishfrom the plot point color for some of the chromosomes.

Figure 6-4. After Step 1 Output colored by Manhattan Category

We now want to examine the results of the analysis to determine if any marker other than Hapmap57042-rs29016514shows significance, since the object of this method is to look for several large-effect markers working together thataffect the trait.

• Open the P-Values from Multi-Locus Mixed Model output and right-click the P-Value (After Step 1) columnand select Sort Ascending.

We see from Figure 6-5 that aside from our covariate marker, there are several markers that are marginally significant(p-value less than 1e-4), but that none of them pass multiple testing correction based on the False Discovery Rate(FDR) (Column 16).

26 Multi-Locus Mixed Model Analysis


Figure 6-5. Marker Significance based on FDR

P-Values from Multi-Locus Mixed Model 27

bovine gwas with mixed linear model tools · bovine gwas with mixed linear model tools, release...

Documents