october 2008bmi chair talk © brian s. yandell1 networking in biochemistry: building a mouse model...

October 2008 BMI Chair Talk © Brian S. Yandell 1

networking in biochemistry:building a mouse model of diabetes

Brian S. Yandell, UW-MadisonOctober 2008

www.stat.wisc.edu/~yandell

Real knowledge is to know the extent of one’s ignorance. Confucius (on a bench in Seattle)


outline

1. how did I got here?2. what problems caught my eye?3. what have I done, anyway?4. how do I work in teams?5. what challenges remain?


how did I get here?• Biostatistics, School of Public Health, UC-Berkeley 1981

– RA/TA with EL Scott, J Neyman, CL Chiang, S Selvin– PhD 1981

• non-parametric inference for hazard rates (Kjell A Doksum)– Annals of Statistics (1983) 50 citations to date (2 in 2008)

• research evolution– early career focus on survival analysis– shift to non-parametric regression (1984-99)– shift to statistical genomics (1991--)

• joined Biometry Program at UW-Madison in 1982– attracted by chance to blend statistics, computing and biology– valued balance of mathematical theory against practice– enjoyed developing methodology driven by collaboration


Yandell “Lab” Projects• Bayesian QTL Model Selection

– R software development (Whipple Neely)– collaboration with UAB & Jackson Labs– data analysis of SCD1, ins10

• meta-analysis for fine mapping Sorcs1– Chr 19 QTL introgressed as congenic lines– combined analysis across to increase power

• QTL-based causal biochemical networks– algorithm development (Elias Chaibub)– data analysis with Christine Ferrara, Duke U


Rosetta:Schadt,

Zhang, Zhu

UAB:Allison,

Yi stat/hort:Yandell

BMI:Kendziorski,

Broman,Craven

Jax:Churchill,von Smith

Duke:Newgaard,

Ferrara

biochem:Attie,

Keller, Zhu


0 5 10 15 20 25 30

01

23

rank order of QTL

addi

tive

effe

ct

Pareto diagram of QTL effects

54

3

2

1

major QTL onlinkage map

majorQTL

minorQTL

polygenes

(modifiers)


problems of single QTL approach

• wrong model: biased view– fool yourself: bad guess at locations, effects– detect ghost QTL between linked loci– miss epistasis completely

• low power• bad science

– use best tools for the job– maximize scarce research resources– leverage already big investment in experiment


advantages of multiple QTL approach• improve statistical power, precision

– increase number of QTL detected– better estimates of loci: less bias, smaller intervals

• improve inference of complex genetic architecture– patterns and individual elements of epistasis– appropriate estimates of means, variances, covariances

• asymptotically unbiased, efficient– assess relative contributions of different QTL

• improve estimates of genotypic values– less bias (more accurate) and smaller variance (more precise)– mean squared error = MSE = (bias)2 + variance


QTL mapping idea• observe phenotype y , marker genotypes m• genetic architecture identifies model

– number and location of QTL– gene action and epistasis (pairwise interactions)

• missing data: genotypes q at may be unknown– pr(q | m, , )– form of genotype model well known

• phenotype y depends on genotype q– pr(y | q, µ, )– often linear model in q– possible interactions among QTL (epistasis)


how does phenotype y improveguess of QTL genotypes q?

90

100

110

120

D4Mit41D4Mit214

Genotype

bp

AAAA

ABAA

AAAB

ABAB

what are probabilitiesfor genotype qbetween markers?

recombinants AA:AB

all 1:1 if ignore yand if we use y?


Gibbs sampler for loci indicators

• QTL at pseudomarkers• loci indicators

= 1 if QTL present = 0 if no QTL present

• Gibbs sampler on loci indicators – relatively easy to incorporate epistasis– Yi et al. (2005, 2007 Genetics)

• (earlier work of Yi, Ina Hoeschele)

1,0 ,),()()( 2112122221110 kq qqqq


likelihood and posterior

• likelihood relates “known” data (y,m,q) to unknown values of interest (,,)– pr(y,q|m,,,) = pr(y|q,,) pr(q|m,,)– mix over unknown genotypes (q)

• posterior turns likelihood into a distribution– weight likelihood by priors– rescale to sum to 1.0– posterior = likelihood * prior / constant


Bayes theorem for QTLs

)|(pr)](pr),|(pr)|(pr),,|([pr * ),,|(pr),|,,,(pr

constant],,,for [prior *likelihood phenotype,,,for posterior

constantprior*likelihoodposterior

mymmqqymyq

qq


why use a Bayesian approach?• first, do both classical and Bayesian

– always nice to have a separate validation– each approach has its strengths and weaknesses

• classical approach works quite well– selects large effect QTL easily– directly builds on regression ideas for model selection

• Bayesian approach is comprehensive– samples most probable genetic architectures– formalizes model selection within one framework– readily (!) extends to more complicated problems


Markov chain sampling• construct Markov chain around posterior

– posterior is stable distribution of Markov chain– use MC samples to estimate posterior

• sample QTL model unknowns from full conditionals– update unknowns one at a time or in batches

Nqqq

myqq

),,,(),,,(),,,(

),|,,,(pr~),,,(

21


Bayes posterior vs. maximum likelihood• LOD: classical Log ODds

– maximize likelihood over effects µ– R/qtl scanone/scantwo: method = “em”

• LPD: Bayesian Log Posterior Density– average posterior over effects µ– R/qtl scanone/scantwo: method = “imp”

qmqqymy

Cdmym

cmy

),|(pr),|(pr),,|(pr

:genotypes QTL missingover mixes likelihood

})(pr),,|(pr)|(pr{log)(LPD

)},,|(pr{maxlog)(LOD

10

10


LOD & LPD: 1 QTLn.ind = 100, 10 cM marker spacing


marginal LOD or LPD• what is contribution of a QTL adjusting for all others?

– improvement in LPD due to QTL at locus – contribution due to main effects, epistasis, GxE?

• how does adjusted LPD differ from unadjusted LPD?– raised by removing variance due to unlinked QTL– raised or lowered due to bias of linked QTL– analogous to Type III adjusted ANOVA tests

• can ask these same questions using classical LOD– see Broman’s newer tools for multiple QTL inference


1-QTL LOD vs. marginal LPD

1-QTL LOD


hyper data: scanone


what is best estimate of QTL?• find most probable pattern

– 1,4,6,15,6:15 has posterior of 3.4%• estimate locus across all nested patterns

– Exact pattern seen ~100/3000 samples– Nested pattern seen ~2000/3000 samples

• estimate 95% confidence interval using quantiles

> best <- qb.best(qbHyper)> summary(best)$best

chrom locus locus.LCL locus.UCL n.qtl247 1 69.9 24.44875 95.7985 0.8026667245 4 29.5 14.20000 74.3000 0.8800000248 6 59.0 13.83333 66.7000 0.7096667246 15 19.5 13.10000 55.7000 0.8450000

> plot(best)Manichaikul et al. 2008Genetics (in review)


what patterns are “near” the best?

• size & shade ~ posterior• distance between patterns

– sum of squared attenuation– match loci between patterns– squared attenuation = (1-2r)2

– sq.atten in scale of LOD & LPD• multidimensional scaling

– MDS projects distance onto 2-D– think mileage between cities


Software for Bayesian QTLsR/qtlbim: www.qtlbim.org

• Properties– cross-compatible with R/qtl– new MCMC algorithms

• Gibbs with loci indicators; no reversible jump– epistasis, fixed & random covariates, GxE– extensive graphics

• Software history– initially designed (Satagopan, Yandell 1996)– major revision and extension (Gaffney 2001)– R/bim to CRAN (Wu, Gaffney, Jin, Yandell 2003)– R/qtlbim to CRAN (Yi, Yandell et al. 2006)

• Publications– Yi et al. (2005); Yandell et al. (2007); Yi et al. (2007ab)


glucose insulin

(courtesy AD Attie)

BTBR mouse isinsulin resistant

B6 is not

make both obese…


studying diabetes in an F2• mouse model: segregating panel from inbred lines

– B6.ob x BTBR.ob F1 F2– selected mice with ob/ob alleles at leptin gene (Chr 6)– sacrificed at 14 weeks, tissues preserved

• physiological study (Stoehr et al. 2000 Diabetes)– mapped body weight, insulin, glucose at various ages

• gene expression studies– RT-PCR for a few mRNA on 108 F2 mice liver tissues

• (Lan et al. 2003 Diabetes; Lan et al. 2003 Genetics)– Affymetrix microarrays on 60 F2 mice liver tissues

• U47 A & B chips, RMA normalization• design: selective phenotyping (Jin et al. 2004 Genetics)


log10(ins10)Chr 19

black=allblue=malered=female purple=sex-

adjusted

solid=512 micedashed=311 mice


Sorcs1 studyin mice:

11 sub-congenic strains

marker regressionmeta-analysis

within-strain permutations

Nature Genetics 2006Clee, Yandell et al.


we were lucky!

BTBR backgroundneeded to see SORCS1

epistatic interactionof chr 19 and 8

…discovered much later 0.8

1.0

1.2

1.4

1.6

1.8

2.0

Interaction plot for D19Mit58 and D8Mit289

D8Mit289

logi

ns10

AA AB BB

AAABBB

D19Mit58

1.0

1.2

1.4

1.6

1.8

Interaction plot for D19Mit58 and D17Mit180

D17Mit180

logi

ns10

AA AB BB

AAABBB

D19Mit58


Sorcs1 gene & SNPs


Sorcs1 study in humans

Diabetes 2007Goodarzi et al.


2M observations30,000 traits60 mice


experimental context• B6 x BTBR obese mouse cross

– model for diabetes and obesity– 500+ mice from intercross (F2)– collaboration with Rosetta/Merck

• genotypes– 5K SNP Affymetrix mouse chip– care in curating genotypes! (map version, errors, …)

• phenotypes– clinical phenotypes (>100 / mouse)– gene expression traits (>40,000 / mouse / 4-6 tissues)– other molecular traits (proteomic, miRNA, metabolomic)


QTL mapping

thousandsof gene

expression traits

PLoS Genetics2006 Lan, Chen et al.


red=transblue=cis

QTLs on chr n

gray scale forvariance


Chaibub Neto et al. (2008)Genetics


causal phenotype networks• goal: mimic biochemical pathways with

directed (causal) networks• problem: association (correlation) does not

imply causation• resolution: bring in driving causes

– genotypes (at conception)– processes earlier in time


Causal vs Reactive? (Elias Chaibub, Brian Yandell)y1 causes y2: y1 ~ g1 and y2 ~ g2*y1


Ferrara et al.


inferring phenotype networks

• build in prior pathway knowledge (PPI, TF)– co-map correlated traits

• Banerjee, Yandell, Yi (2008 Genetics)– pathways induce correlation structure

• ramp up to 100s, 1000s of phenotypes?– danger of mixing unrelated pathways– want closely linked upstream (causal) drivers


Rosetta:Schadt,

Zhang, Zhu

UAB:Allison,

Yi stat/hort:Yandell

BMI:Kendziorski,

Broman,Craven

Jax:Churchill,von Smith

Duke:Newgaard,

Ferrara

biochem:Attie,

Keller, Zhu


why build Web eQTL tools?

• common storage/maintainence of data– one well-curated copy – central repository– reduce errors, ensure analysis on same data

• automate commonly used methods– biologist gets immediate feedback– statistician can focus on new methods– codify standard choices


how does one build tools?• no one solution for all situations• use existing tools wherever possible

– new tools take time and care to build!– downloaded databases must be updated regularly

• human component is key– need informatics expertise– need continual dialog with biologists

• build bridges (interfaces) between tools– Web interface uses PHP– commands are created dynamically for R

• continually rethink & redesign organization


steps in using Web tools

• user enters data on Web page• PHP tool interprets user data• PHP builds R script• R run on script

– creates plots, summaries, warnings• PHP grabs results & displays on page• user examines, saves• user modifies data and reruns


raw data or fancy results?

• raw data flexible but slow– LOD profiles for 100 (1000) traits?

• fancy results from sophisticated analysis– IM, MIM, BIM, MOM analysis– too complicated to put in biologists’ hands?

• methods are unrefined, state-of-art, research tools• use of methods involved many subtle choices

– batch computation over weeks• compute once, save, display many times


LOD profiles: many traits


1.5 LOD interval approximate 95% CI


red=transblue=cis

QTLs on chr n

gray scale forvariance


what challenges remain?• from eQTL to candidate pathways

– statistical issues• networks, correlated traits• better model selection approaches

– biological evidence (Weiss 2007 Genetics)• Mouse to human to mouse• KOs, etc.

• upgrade informatics environment– harden local code (R, Python, PHP, …)– build on other high throughput systems

• Swertz, Jansen (2007); Stein (2008) Nat Rev Gen


many thanksKarl BromanJackson Labs

Gary ChurchillHao WuRandy von Smith

U AL BirminghamDavid Allison Nengjun YiTapan MehtaSamprit BanerjeeRam VenkataramanDaniel Shriner

Michael NewtonHyuna Yang

Daniel SorensenDaniel Gianola

Liang Li

my studentsJaya SatagopanFei ZouPatrick GaffneyChunfang JinElias Chaibub NetoW Whipple NeelyJee Young Moon

USDA Hatch, NIH/NIDDK (Attie), NIH/R01 (Yi, Broman)

Tom OsbornDavid ButruilleMarcio FerreraJosh UdahlPablo Quijada

Alan AttieJonathan StoehrHong LanSusie CleeJessica ByersMark Keller

october 2008bmi chair talk © brian s. yandell1 networking in biochemistry: building a mouse model...

Documents