gensel4 august 2011. command line gensel can be run from commandline for example gensel4 (provided...

GenSel4August 2011

Command Line

GenSel can be run from commandlineFor example gensel4 (provided path set appropriately)

GenSel can be run from the BIGS web interface

GenSel jobs can be submitted to the queue on the HPC using the bigscli command from the unix interface of BIGS

Usagegensel input_file_name

nohup gensel input_file_name –s status_file_name

Genotypes & Phenotypes

Required for all analysestrainPhenotypeFileName

markerFileName

Used for analysisRead by GenSel4

Genotype File Structure

Space delimited unix file (dos2unix to convert)header row plus one row for each animal

column for ID then a column for each genotype

One header rowAlphanumeric labels for each genotype/locus

One row for each animalAlphanumeric ID followed by all the genotypes

-10, 0 or 10 for AA, AB or BB (no support for missing genotypes)

Ordered by genomic location if no map file

Read in binary format (end in .newbin)Text files are converted to binary in the first analysis

Must be same number of columns in every row

Example Genotype File

ISU_ID.bt isu_1 nadc_1 isu_2 isu_3 isu_4

ISU_Angus_1 -10 0 10 10 0

AAA00001 -10 10 -10 10 0

Tag_number_ab

-10 -10 0 10 0

Casanova_bull

-10 10 6.5 -10 10

Disk requirements for 5,000 bovine 50k genotypes in text form are about 1Gb(and the same file in binary format is typically half the size)

Species are designated by the first letters of Genus and species bt = Bos Taurus; hs=hom sapiens; oa=ovis ariesl ss=sus

scrofa etcThis will later provide functionality for species specific genome browsing

Phenotype File Structure

Space delimited unix file

Separate phenotype file for each trait

Header row plus one row for each animal with phenotype

Alphanumeric animal ID must be in column 1

Trait value must be in column 2 (label in header)

Remainder of file is arbitrary but defines model for traitRecommend to at least involve a column of 1’s for the mean

Columns headed by alphanumerics – all rows have same no of columns

Columns headed by name ending in $ are class variables

Columns headed by other names are covariates

Columns ending in # are ignored

Column headed by rinverse specifies a weighted analysis

Example Phenotype File

Animal IQ mean dob Sex$ Family#

rinverse

A_1 100 1 100 male 1 1.0

B_2 95 1 105 female 1 0.9

C.12345

103 1 97 spey 2 .95

Spot 110 1 90 male 2 1.1

rinverse is only proportional (scalar variance factored out) covariates must be numbers! categorical traits must be numbered from 1 upwards trait in column 2 (not required for prediction) sensible to at least have the mean model does not need to be full rank

GenSel matches IDs

Only records with the same alphanumeric ID in the genotype and phenotype file are available for subsequent analysis

Start of analysis reports the number of animals in the genotype file, phenotype file and matching records

Genotypes & Map Files

GenSel now supports the use of a map file

A map file provides chromosome and basepair position information for at least one build

Can support any number of builds

A map file may provide multiple aliases for marker names

Every marker name from the genotype file must exist somewhere in the map file

Additional marker names can be in the map file.

Map File Structure

Rs_num

Ss_num

ISU_ID UMD_chr

UMD_pos

BTA_chr BTA_pos

Rs_001

101 isu_1 1 100000 1 95123

1234 102 isu_2 2 1234567 2 1500000

5678 103 isu_9 2 987654321

2 10000000

910a 104 isu_5 X 0 PAR 2543

newS newS nadc_1 unk 0 unk N/A

Space delimited unix text file

Map File Options

The minimum requirements aremapFileName

linkageMap (options depend upon your mapfile)eg UMD or BTA for my example on last page

This will result in columns of the genotype file being sorted into genomic order to facilitate formation of contiguous marker windows – automatically formed in 1Mb sizes

Options includeaddMapInfoToMarkers yes

Results in chromosome and base pair position added to output

outputMarkerHeaderName (options are aliases in your map file)

Filtering Genotypes

4 methods to filter columns of the genotype file for analysis

Two approaches are always availableincludeFileName or excludeFileName

These files contain a list of marker names as in the genotype file header that are to be included or excluded

Include takes precedence over exclude

Two other approaches are available if a map file is used windowIncFileName or windowExclFileName

List of chromosome_names to include/exclude entire chromosome

List of chromosome_name start_bp end_bp

Map files & SNP names

Sometimes the genotype file uses one marker name (eg database numeric ID), but the marker output file would benefit from having a different name (eg rs number)

Given a map file, Predict can cross reference the different marker names so you can exchange marker results (.mrkRes) files with other users

Output File Name Conventions

Suppose GenSel is run using gensel4 demo.inp

The root for all output files will be “demo”

All options will produce output to demo.out# where # is the next available integer not already used

The first run produces demo.out1, the next demo.out2 etc

Most other options produce additional files that will have the same root name and the same suffix number as the .out file

demo.LD1, demo.mrkRes1, demo.ghat1, demo.winVar1 etc

Analysis Options

Many calculations are time consumingComputing window Variance

Validating predictive accuracy in test data

Computing PEV and R2

These are only done in some iterations according to the outputFrequency option

Default is 100 so these calculations occur for 1% iterations

Markov Chains use many random numbersThe seed option (default 1234) can be used to alter sampling

Print

analysisType PrintThis can be used to get a printout of the X matrix, ordered by map position if a map file is used, for just those animals in the genotype and phenotype file

The output contains the covariates on a 0, 1, 2 scale, before centering, not on the -10, 0 , 10 scale used in the marker genotype file

LD

analysisType LDThis computes the pairwise squared correlation between every pair of markers in the filtered genotype file

Also computes the minor allele frequencies (MAF)

The output file will be very large if you don’t filter itOnly squared correlations exceeding minLDoutput are stored

minLDoutput (default 0.1)

StepWise

analysisType StepWise

Computes (unweighted) forward and reverse submodels after first fitting all the fixed effects

R2 is defined as the proportion of sums of squares after the fixed effects

Three options control the modelinputMaxRsquared (default 0.8) will stop the analysis

inputMaxMarkers (default 100) will stop the analysis

alphaValue (default 0.05) controls significance

Bayes

analysisType Bayes

bayesType BayesBMetropolis-Hastings

Gibbs Sampling

bayesType BayesA (Actually just BayesB with pi=0)

bayesType BayesC

bayesType BayesCPi (Actually BayesC but with pi estimation)

bayesType RBR (Robust Bayesian Regression) Really Bayes B but with pi, Scale and df (genetic) estimation

FindScale options (no, yes) or for BayesCPi (thruPi)

Bayes Priors

Priors and associated degrees of freedom are required for the genetic and residual variance

genVariance (default 1)degreesFreedomEffectVar (default 4)

resVariance (default 1)nuRes (default 10)

Better estimates of genVariance and resVariance should be used

From knowledge of heritability and phenotypic sd

Bayes Options

All analysisType Bayes jobs have extra options

burnIn is the number of iterations in the chain to discard

Probably doesn’t need to be very many (eg 1,000)

chainLength is the number of iterations in the chainTypically use 41,000 or more (this includes burnIn)

Mixture models (BayesB, BayesC, BayesCPi, RBR) assume a fraction (1-pi) of markers have an effect and pi have 0 effect

Option is for example probFixed 0.95

Bayes Options

BayesB (and therefore BayesA = B0) used to use a Metropolis-Hastings rather than a Gibbs sampler

MHG did 100 MH iterations

Our fast version used a different proposal distribution and required no more than 10 MH iterations

You can specify numMHIter

Long developed an alternative sampler that does not use MH

You select this option using numMHIter 0It is faster – the same speed as BayesC

Bayes Options

The 1 Mb windows formed using a map file can be used to compute the variance of the window

This is turned on using windowBV yes

Note the number of markers in each window varies with SNP density along the genome (many markers for chrom unk)

This provides posterior distributions of windows so that the previous Permute and Bootstrap options are no longer needed or supported

In the absence of a map file, the columns in the genotype file are assumed to be consecutive, and the number of markers in a window are defined by the windowWidth option

The default is 5

Automatically get graphs of posteriors and table of variances

Note window Variances typically don’t sum to 100 due to nonzero covariances

Predict

analysisType Predict

markerSolFilename defines the name of a .mrkRes file from a previous training analysis

windowWidth defines the number of markers in a consecutive window from which the overlapping window variances are computed

windowBV yes will result in a file full of ghats with a row for each animal and a column for each overlapping window

GenerateData

Randomly chooses 1-probFixed proportion of loci to be QTL

Samples QTL effects and residual effects according to normal distributions with mean 0 and variance determined by varGenotypic and varResidual

Outputs the simulated genotypes and phenotypes

Phenotypes will be categorical if isCategorical yes with as many categories as specified by numCat (default 2)

Categories will be equal sizes unless specified by the option PortOfCat (eg 0.70:0.20) if numCat 3

Validation

There are two options for validationValidation can be done jointly with the training analysis

trainPhenotypeFileName

testPhenotypeFileName

If no testPhenotypeFileName, training data is used

This will produce ghat, PEV and R2 for validation animals

Validation can be done in a later session from trainingThis will produce ghat but no PEV or R2

All columns of phenotype file are copied into the ghat file to facilitate downstream analysis

Graphing Posteriors

Various posterior distributions will be output if desired using the key word plotPosteriors yes

Samples used in the graphs are in .mcmcSamples which can be produced without graphing if mcmcSamples yes

Requires that gnuplot is installed on the machine in a location accessible using the defined path

Categorical Options

All analysisType Bayes will do categorical analyses if the option isCategorical yes is used

Categories must start from 1, and be ordered without missing categories

Required Libraries

Many routines use matvec libraries

Most matrix and vector computations use Eigen3

GSL is no longer used

Boost is used (only for format statements)

Limited use of STL

Graphics options require gnuplot

Environment must include paths to gnuplot (/opt/local/bin)

R version

We are developing an R version that will allow you to run any or all of the options from R

Also allow you access to variables created during the analysis

Hope to allow you to replace existing procedures with your own for prototyping new methods or features

Planned Developments

Addition of partial least squares (PLS), Bayesian Lasso

Addition of further random factors beyond the genotypes

Using pedigree, genomic or identity variance-covariance matrices

Extension to multiple trait analysis

Implementation using CUDA graphics processors

gensel4 august 2011. command line gensel can be run from commandline for example gensel4 (provided...

Documents