gensel4 august 2011. command line gensel can be run from commandline for example gensel4 (provided...
TRANSCRIPT
GenSel4August 2011
Command Line
GenSel can be run from commandlineFor example gensel4 (provided path set appropriately)
GenSel can be run from the BIGS web interface
GenSel jobs can be submitted to the queue on the HPC using the bigscli command from the unix interface of BIGS
Usagegensel input_file_name
nohup gensel input_file_name –s status_file_name
Genotypes & Phenotypes
Required for all analysestrainPhenotypeFileName
markerFileName
Used for analysisRead by GenSel4
Genotype File Structure
Space delimited unix file (dos2unix to convert)header row plus one row for each animal
column for ID then a column for each genotype
One header rowAlphanumeric labels for each genotype/locus
One row for each animalAlphanumeric ID followed by all the genotypes
-10, 0 or 10 for AA, AB or BB (no support for missing genotypes)
Ordered by genomic location if no map file
Read in binary format (end in .newbin)Text files are converted to binary in the first analysis
Must be same number of columns in every row
Example Genotype File
ISU_ID.bt isu_1 nadc_1 isu_2 isu_3 isu_4
ISU_Angus_1 -10 0 10 10 0
AAA00001 -10 10 -10 10 0
Tag_number_ab
-10 -10 0 10 0
Casanova_bull
-10 10 6.5 -10 10
Disk requirements for 5,000 bovine 50k genotypes in text form are about 1Gb(and the same file in binary format is typically half the size)
Species are designated by the first letters of Genus and species bt = Bos Taurus; hs=hom sapiens; oa=ovis ariesl ss=sus
scrofa etcThis will later provide functionality for species specific genome browsing
Phenotype File Structure
Space delimited unix file
Separate phenotype file for each trait
Header row plus one row for each animal with phenotype
Alphanumeric animal ID must be in column 1
Trait value must be in column 2 (label in header)
Remainder of file is arbitrary but defines model for traitRecommend to at least involve a column of 1’s for the mean
Columns headed by alphanumerics – all rows have same no of columns
Columns headed by name ending in $ are class variables
Columns headed by other names are covariates
Columns ending in # are ignored
Column headed by rinverse specifies a weighted analysis
Example Phenotype File
Animal IQ mean dob Sex$ Family#
rinverse
A_1 100 1 100 male 1 1.0
B_2 95 1 105 female 1 0.9
C.12345
103 1 97 spey 2 .95
Spot 110 1 90 male 2 1.1
rinverse is only proportional (scalar variance factored out) covariates must be numbers! categorical traits must be numbered from 1 upwards trait in column 2 (not required for prediction) sensible to at least have the mean model does not need to be full rank
GenSel matches IDs
Only records with the same alphanumeric ID in the genotype and phenotype file are available for subsequent analysis
Start of analysis reports the number of animals in the genotype file, phenotype file and matching records
Genotypes & Map Files
GenSel now supports the use of a map file
A map file provides chromosome and basepair position information for at least one build
Can support any number of builds
A map file may provide multiple aliases for marker names
Every marker name from the genotype file must exist somewhere in the map file
Additional marker names can be in the map file.
Map File Structure
Rs_num
Ss_num
ISU_ID UMD_chr
UMD_pos
BTA_chr BTA_pos
Rs_001
101 isu_1 1 100000 1 95123
1234 102 isu_2 2 1234567 2 1500000
5678 103 isu_9 2 987654321
2 10000000
910a 104 isu_5 X 0 PAR 2543
newS newS nadc_1 unk 0 unk N/A
Space delimited unix text file
Map File Options
The minimum requirements aremapFileName
linkageMap (options depend upon your mapfile)eg UMD or BTA for my example on last page
This will result in columns of the genotype file being sorted into genomic order to facilitate formation of contiguous marker windows – automatically formed in 1Mb sizes
Options includeaddMapInfoToMarkers yes
Results in chromosome and base pair position added to output
outputMarkerHeaderName (options are aliases in your map file)
Filtering Genotypes
4 methods to filter columns of the genotype file for analysis
Two approaches are always availableincludeFileName or excludeFileName
These files contain a list of marker names as in the genotype file header that are to be included or excluded
Include takes precedence over exclude
Two other approaches are available if a map file is used windowIncFileName or windowExclFileName
List of chromosome_names to include/exclude entire chromosome
List of chromosome_name start_bp end_bp
Map files & SNP names
Sometimes the genotype file uses one marker name (eg database numeric ID), but the marker output file would benefit from having a different name (eg rs number)
Given a map file, Predict can cross reference the different marker names so you can exchange marker results (.mrkRes) files with other users
Output File Name Conventions
Suppose GenSel is run using gensel4 demo.inp
The root for all output files will be “demo”
All options will produce output to demo.out# where # is the next available integer not already used
The first run produces demo.out1, the next demo.out2 etc
Most other options produce additional files that will have the same root name and the same suffix number as the .out file
demo.LD1, demo.mrkRes1, demo.ghat1, demo.winVar1 etc
Analysis Options
Many calculations are time consumingComputing window Variance
Validating predictive accuracy in test data
Computing PEV and R2
These are only done in some iterations according to the outputFrequency option
Default is 100 so these calculations occur for 1% iterations
Markov Chains use many random numbersThe seed option (default 1234) can be used to alter sampling
analysisType PrintThis can be used to get a printout of the X matrix, ordered by map position if a map file is used, for just those animals in the genotype and phenotype file
The output contains the covariates on a 0, 1, 2 scale, before centering, not on the -10, 0 , 10 scale used in the marker genotype file
LD
analysisType LDThis computes the pairwise squared correlation between every pair of markers in the filtered genotype file
Also computes the minor allele frequencies (MAF)
The output file will be very large if you don’t filter itOnly squared correlations exceeding minLDoutput are stored
minLDoutput (default 0.1)
StepWise
analysisType StepWise
Computes (unweighted) forward and reverse submodels after first fitting all the fixed effects
R2 is defined as the proportion of sums of squares after the fixed effects
Three options control the modelinputMaxRsquared (default 0.8) will stop the analysis
inputMaxMarkers (default 100) will stop the analysis
alphaValue (default 0.05) controls significance
Bayes
analysisType Bayes
bayesType BayesBMetropolis-Hastings
Gibbs Sampling
bayesType BayesA (Actually just BayesB with pi=0)
bayesType BayesC
bayesType BayesCPi (Actually BayesC but with pi estimation)
bayesType RBR (Robust Bayesian Regression) Really Bayes B but with pi, Scale and df (genetic) estimation
FindScale options (no, yes) or for BayesCPi (thruPi)
Bayes Priors
Priors and associated degrees of freedom are required for the genetic and residual variance
genVariance (default 1)degreesFreedomEffectVar (default 4)
resVariance (default 1)nuRes (default 10)
Better estimates of genVariance and resVariance should be used
From knowledge of heritability and phenotypic sd
Bayes Options
All analysisType Bayes jobs have extra options
burnIn is the number of iterations in the chain to discard
Probably doesn’t need to be very many (eg 1,000)
chainLength is the number of iterations in the chainTypically use 41,000 or more (this includes burnIn)
Mixture models (BayesB, BayesC, BayesCPi, RBR) assume a fraction (1-pi) of markers have an effect and pi have 0 effect
Option is for example probFixed 0.95
Bayes Options
BayesB (and therefore BayesA = B0) used to use a Metropolis-Hastings rather than a Gibbs sampler
MHG did 100 MH iterations
Our fast version used a different proposal distribution and required no more than 10 MH iterations
You can specify numMHIter
Long developed an alternative sampler that does not use MH
You select this option using numMHIter 0It is faster – the same speed as BayesC
Bayes Options
The 1 Mb windows formed using a map file can be used to compute the variance of the window
This is turned on using windowBV yes
Note the number of markers in each window varies with SNP density along the genome (many markers for chrom unk)
This provides posterior distributions of windows so that the previous Permute and Bootstrap options are no longer needed or supported
In the absence of a map file, the columns in the genotype file are assumed to be consecutive, and the number of markers in a window are defined by the windowWidth option
The default is 5
Automatically get graphs of posteriors and table of variances
Note window Variances typically don’t sum to 100 due to nonzero covariances
Predict
analysisType Predict
markerSolFilename defines the name of a .mrkRes file from a previous training analysis
windowWidth defines the number of markers in a consecutive window from which the overlapping window variances are computed
windowBV yes will result in a file full of ghats with a row for each animal and a column for each overlapping window
GenerateData
Randomly chooses 1-probFixed proportion of loci to be QTL
Samples QTL effects and residual effects according to normal distributions with mean 0 and variance determined by varGenotypic and varResidual
Outputs the simulated genotypes and phenotypes
Phenotypes will be categorical if isCategorical yes with as many categories as specified by numCat (default 2)
Categories will be equal sizes unless specified by the option PortOfCat (eg 0.70:0.20) if numCat 3
Validation
There are two options for validationValidation can be done jointly with the training analysis
trainPhenotypeFileName
testPhenotypeFileName
If no testPhenotypeFileName, training data is used
This will produce ghat, PEV and R2 for validation animals
Validation can be done in a later session from trainingThis will produce ghat but no PEV or R2
All columns of phenotype file are copied into the ghat file to facilitate downstream analysis
Graphing Posteriors
Various posterior distributions will be output if desired using the key word plotPosteriors yes
Samples used in the graphs are in .mcmcSamples which can be produced without graphing if mcmcSamples yes
Requires that gnuplot is installed on the machine in a location accessible using the defined path
Categorical Options
All analysisType Bayes will do categorical analyses if the option isCategorical yes is used
Categories must start from 1, and be ordered without missing categories
Required Libraries
Many routines use matvec libraries
Most matrix and vector computations use Eigen3
GSL is no longer used
Boost is used (only for format statements)
Limited use of STL
Graphics options require gnuplot
Environment must include paths to gnuplot (/opt/local/bin)
R version
We are developing an R version that will allow you to run any or all of the options from R
Also allow you access to variables created during the analysis
Hope to allow you to replace existing procedures with your own for prototyping new methods or features
Planned Developments
Addition of partial least squares (PLS), Bayesian Lasso
Addition of further random factors beyond the genotypes
Using pedigree, genomic or identity variance-covariance matrices
Extension to multiple trait analysis
Implementation using CUDA graphics processors