gensel4 august 2011. command line gensel can be run from commandline for example gensel4 (provided...

35
GenSel4 August 2011

Upload: lily-clark

Post on 18-Dec-2015

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

GenSel4August 2011

Page 2: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Command Line

GenSel can be run from commandlineFor example gensel4 (provided path set appropriately)

GenSel can be run from the BIGS web interface

GenSel jobs can be submitted to the queue on the HPC using the bigscli command from the unix interface of BIGS

Usagegensel input_file_name

nohup gensel input_file_name –s status_file_name

Page 3: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Genotypes & Phenotypes

Required for all analysestrainPhenotypeFileName

markerFileName

Used for analysisRead by GenSel4

Page 4: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Genotype File Structure

Space delimited unix file (dos2unix to convert)header row plus one row for each animal

column for ID then a column for each genotype

One header rowAlphanumeric labels for each genotype/locus

One row for each animalAlphanumeric ID followed by all the genotypes

-10, 0 or 10 for AA, AB or BB (no support for missing genotypes)

Ordered by genomic location if no map file

Read in binary format (end in .newbin)Text files are converted to binary in the first analysis

Must be same number of columns in every row

Page 5: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Example Genotype File

ISU_ID.bt isu_1 nadc_1 isu_2 isu_3 isu_4

ISU_Angus_1 -10 0 10 10 0

AAA00001 -10 10 -10 10 0

Tag_number_ab

-10 -10 0 10 0

Casanova_bull

-10 10 6.5 -10 10

Disk requirements for 5,000 bovine 50k genotypes in text form are about 1Gb(and the same file in binary format is typically half the size)

Species are designated by the first letters of Genus and species bt = Bos Taurus; hs=hom sapiens; oa=ovis ariesl ss=sus

scrofa etcThis will later provide functionality for species specific genome browsing

Page 6: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Phenotype File Structure

Space delimited unix file

Separate phenotype file for each trait

Header row plus one row for each animal with phenotype

Alphanumeric animal ID must be in column 1

Trait value must be in column 2 (label in header)

Remainder of file is arbitrary but defines model for traitRecommend to at least involve a column of 1’s for the mean

Columns headed by alphanumerics – all rows have same no of columns

Columns headed by name ending in $ are class variables

Columns headed by other names are covariates

Columns ending in # are ignored

Column headed by rinverse specifies a weighted analysis

Page 7: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Example Phenotype File

Animal IQ mean dob Sex$ Family#

rinverse

A_1 100 1 100 male 1 1.0

B_2 95 1 105 female 1 0.9

C.12345

103 1 97 spey 2 .95

Spot 110 1 90 male 2 1.1

rinverse is only proportional (scalar variance factored out) covariates must be numbers! categorical traits must be numbered from 1 upwards trait in column 2 (not required for prediction) sensible to at least have the mean model does not need to be full rank

Page 8: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

GenSel matches IDs

Only records with the same alphanumeric ID in the genotype and phenotype file are available for subsequent analysis

Start of analysis reports the number of animals in the genotype file, phenotype file and matching records

Page 9: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Genotypes & Map Files

GenSel now supports the use of a map file

A map file provides chromosome and basepair position information for at least one build

Can support any number of builds

A map file may provide multiple aliases for marker names

Every marker name from the genotype file must exist somewhere in the map file

Additional marker names can be in the map file.

Page 10: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Map File Structure

Rs_num

Ss_num

ISU_ID UMD_chr

UMD_pos

BTA_chr BTA_pos

Rs_001

101 isu_1 1 100000 1 95123

1234 102 isu_2 2 1234567 2 1500000

5678 103 isu_9 2 987654321

2 10000000

910a 104 isu_5 X 0 PAR 2543

newS newS nadc_1 unk 0 unk N/A

Space delimited unix text file

Page 11: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Map File Options

The minimum requirements aremapFileName

linkageMap (options depend upon your mapfile)eg UMD or BTA for my example on last page

This will result in columns of the genotype file being sorted into genomic order to facilitate formation of contiguous marker windows – automatically formed in 1Mb sizes

Options includeaddMapInfoToMarkers yes

Results in chromosome and base pair position added to output

outputMarkerHeaderName (options are aliases in your map file)

Page 12: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Filtering Genotypes

4 methods to filter columns of the genotype file for analysis

Two approaches are always availableincludeFileName or excludeFileName

These files contain a list of marker names as in the genotype file header that are to be included or excluded

Include takes precedence over exclude

Two other approaches are available if a map file is used windowIncFileName or windowExclFileName

List of chromosome_names to include/exclude entire chromosome

List of chromosome_name start_bp end_bp

Page 13: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Map files & SNP names

Sometimes the genotype file uses one marker name (eg database numeric ID), but the marker output file would benefit from having a different name (eg rs number)

Given a map file, Predict can cross reference the different marker names so you can exchange marker results (.mrkRes) files with other users

Page 14: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Output File Name Conventions

Suppose GenSel is run using gensel4 demo.inp

The root for all output files will be “demo”

All options will produce output to demo.out# where # is the next available integer not already used

The first run produces demo.out1, the next demo.out2 etc

Most other options produce additional files that will have the same root name and the same suffix number as the .out file

demo.LD1, demo.mrkRes1, demo.ghat1, demo.winVar1 etc

Page 15: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Analysis Options

Many calculations are time consumingComputing window Variance

Validating predictive accuracy in test data

Computing PEV and R2

These are only done in some iterations according to the outputFrequency option

Default is 100 so these calculations occur for 1% iterations

Markov Chains use many random numbersThe seed option (default 1234) can be used to alter sampling

Page 16: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Print

analysisType PrintThis can be used to get a printout of the X matrix, ordered by map position if a map file is used, for just those animals in the genotype and phenotype file

The output contains the covariates on a 0, 1, 2 scale, before centering, not on the -10, 0 , 10 scale used in the marker genotype file

Page 17: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

LD

analysisType LDThis computes the pairwise squared correlation between every pair of markers in the filtered genotype file

Also computes the minor allele frequencies (MAF)

The output file will be very large if you don’t filter itOnly squared correlations exceeding minLDoutput are stored

minLDoutput (default 0.1)

Page 18: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

StepWise

analysisType StepWise

Computes (unweighted) forward and reverse submodels after first fitting all the fixed effects

R2 is defined as the proportion of sums of squares after the fixed effects

Three options control the modelinputMaxRsquared (default 0.8) will stop the analysis

inputMaxMarkers (default 100) will stop the analysis

alphaValue (default 0.05) controls significance

Page 19: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Bayes

analysisType Bayes

bayesType BayesBMetropolis-Hastings

Gibbs Sampling

bayesType BayesA (Actually just BayesB with pi=0)

bayesType BayesC

bayesType BayesCPi (Actually BayesC but with pi estimation)

bayesType RBR (Robust Bayesian Regression) Really Bayes B but with pi, Scale and df (genetic) estimation

FindScale options (no, yes) or for BayesCPi (thruPi)

Page 20: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Bayes Priors

Priors and associated degrees of freedom are required for the genetic and residual variance

genVariance (default 1)degreesFreedomEffectVar (default 4)

resVariance (default 1)nuRes (default 10)

Better estimates of genVariance and resVariance should be used

From knowledge of heritability and phenotypic sd

Page 21: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Bayes Options

All analysisType Bayes jobs have extra options

burnIn is the number of iterations in the chain to discard

Probably doesn’t need to be very many (eg 1,000)

chainLength is the number of iterations in the chainTypically use 41,000 or more (this includes burnIn)

Mixture models (BayesB, BayesC, BayesCPi, RBR) assume a fraction (1-pi) of markers have an effect and pi have 0 effect

Option is for example probFixed 0.95

Page 22: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Bayes Options

BayesB (and therefore BayesA = B0) used to use a Metropolis-Hastings rather than a Gibbs sampler

MHG did 100 MH iterations

Our fast version used a different proposal distribution and required no more than 10 MH iterations

You can specify numMHIter

Long developed an alternative sampler that does not use MH

You select this option using numMHIter 0It is faster – the same speed as BayesC

Page 23: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Bayes Options

The 1 Mb windows formed using a map file can be used to compute the variance of the window

This is turned on using windowBV yes

Note the number of markers in each window varies with SNP density along the genome (many markers for chrom unk)

This provides posterior distributions of windows so that the previous Permute and Bootstrap options are no longer needed or supported

In the absence of a map file, the columns in the genotype file are assumed to be consecutive, and the number of markers in a window are defined by the windowWidth option

The default is 5

Automatically get graphs of posteriors and table of variances

Page 24: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the
Page 25: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Note window Variances typically don’t sum to 100 due to nonzero covariances

Page 26: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the
Page 27: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the
Page 28: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Predict

analysisType Predict

markerSolFilename defines the name of a .mrkRes file from a previous training analysis

windowWidth defines the number of markers in a consecutive window from which the overlapping window variances are computed

windowBV yes will result in a file full of ghats with a row for each animal and a column for each overlapping window

Page 29: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

GenerateData

Randomly chooses 1-probFixed proportion of loci to be QTL

Samples QTL effects and residual effects according to normal distributions with mean 0 and variance determined by varGenotypic and varResidual

Outputs the simulated genotypes and phenotypes

Phenotypes will be categorical if isCategorical yes with as many categories as specified by numCat (default 2)

Categories will be equal sizes unless specified by the option PortOfCat (eg 0.70:0.20) if numCat 3

Page 30: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Validation

There are two options for validationValidation can be done jointly with the training analysis

trainPhenotypeFileName

testPhenotypeFileName

If no testPhenotypeFileName, training data is used

This will produce ghat, PEV and R2 for validation animals

Validation can be done in a later session from trainingThis will produce ghat but no PEV or R2

All columns of phenotype file are copied into the ghat file to facilitate downstream analysis

Page 31: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Graphing Posteriors

Various posterior distributions will be output if desired using the key word plotPosteriors yes

Samples used in the graphs are in .mcmcSamples which can be produced without graphing if mcmcSamples yes

Requires that gnuplot is installed on the machine in a location accessible using the defined path

Page 32: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Categorical Options

All analysisType Bayes will do categorical analyses if the option isCategorical yes is used

Categories must start from 1, and be ordered without missing categories

Page 33: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Required Libraries

Many routines use matvec libraries

Most matrix and vector computations use Eigen3

GSL is no longer used

Boost is used (only for format statements)

Limited use of STL

Graphics options require gnuplot

Environment must include paths to gnuplot (/opt/local/bin)

Page 34: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

R version

We are developing an R version that will allow you to run any or all of the options from R

Also allow you access to variables created during the analysis

Hope to allow you to replace existing procedures with your own for prototyping new methods or features

Page 35: GenSel4 August 2011. Command Line GenSel can be run from commandline For example gensel4 (provided path set appropriately) GenSel can be run from the

Planned Developments

Addition of partial least squares (PLS), Bayesian Lasso

Addition of further random factors beyond the genotypes

Using pedigree, genomic or identity variance-covariance matrices

Extension to multiple trait analysis

Implementation using CUDA graphics processors