october 2008bmi chair talk © brian s. yandell1 networking in biochemistry: building a mouse model...
DESCRIPTION
October 2008BMI Chair Talk © Brian S. Yandell3 how did I get here? Biostatistics, School of Public Health, UC-Berkeley 1981 –RA/TA with EL Scott, J Neyman, CL Chiang, S Selvin –PhD 1981 non-parametric inference for hazard rates (Kjell A Doksum) –Annals of Statistics (1983) 50 citations to date (2 in 2008) research evolution –early career focus on survival analysis –shift to non-parametric regression ( ) –shift to statistical genomics (1991--) joined Biometry Program at UW-Madison in 1982 –attracted by chance to blend statistics, computing and biology –valued balance of mathematical theory against practice –enjoyed developing methodology driven by collaborationTRANSCRIPT
October 2008 BMI Chair Talk © Brian S. Yandell 1
networking in biochemistry:building a mouse model of diabetes
Brian S. Yandell, UW-MadisonOctober 2008
www.stat.wisc.edu/~yandell
Real knowledge is to know the extent of one’s ignorance. Confucius (on a bench in Seattle)
October 2008 BMI Chair Talk © Brian S. Yandell 2
outline
1. how did I got here?2. what problems caught my eye?3. what have I done, anyway?4. how do I work in teams?5. what challenges remain?
October 2008 BMI Chair Talk © Brian S. Yandell 3
how did I get here?• Biostatistics, School of Public Health, UC-Berkeley 1981
– RA/TA with EL Scott, J Neyman, CL Chiang, S Selvin– PhD 1981
• non-parametric inference for hazard rates (Kjell A Doksum)– Annals of Statistics (1983) 50 citations to date (2 in 2008)
• research evolution– early career focus on survival analysis– shift to non-parametric regression (1984-99)– shift to statistical genomics (1991--)
• joined Biometry Program at UW-Madison in 1982– attracted by chance to blend statistics, computing and biology– valued balance of mathematical theory against practice– enjoyed developing methodology driven by collaboration
October 2008 BMI Chair Talk © Brian S. Yandell 4
Yandell “Lab” Projects• Bayesian QTL Model Selection
– R software development (Whipple Neely)– collaboration with UAB & Jackson Labs– data analysis of SCD1, ins10
• meta-analysis for fine mapping Sorcs1– Chr 19 QTL introgressed as congenic lines– combined analysis across to increase power
• QTL-based causal biochemical networks– algorithm development (Elias Chaibub)– data analysis with Christine Ferrara, Duke U
October 2008 BMI Chair Talk © Brian S. Yandell 5
Rosetta:Schadt,
Zhang, Zhu
UAB:Allison,
Yi stat/hort:Yandell
BMI:Kendziorski,
Broman,Craven
Jax:Churchill,von Smith
Duke:Newgaard,
Ferrara
biochem:Attie,
Keller, Zhu
October 2008 BMI Chair Talk © Brian S. Yandell 6
0 5 10 15 20 25 30
01
23
rank order of QTL
addi
tive
effe
ct
Pareto diagram of QTL effects
54
3
2
1
major QTL onlinkage map
majorQTL
minorQTL
polygenes
(modifiers)
October 2008 BMI Chair Talk © Brian S. Yandell 7
problems of single QTL approach
• wrong model: biased view– fool yourself: bad guess at locations, effects– detect ghost QTL between linked loci– miss epistasis completely
• low power• bad science
– use best tools for the job– maximize scarce research resources– leverage already big investment in experiment
October 2008 BMI Chair Talk © Brian S. Yandell 8
advantages of multiple QTL approach• improve statistical power, precision
– increase number of QTL detected– better estimates of loci: less bias, smaller intervals
• improve inference of complex genetic architecture– patterns and individual elements of epistasis– appropriate estimates of means, variances, covariances
• asymptotically unbiased, efficient– assess relative contributions of different QTL
• improve estimates of genotypic values– less bias (more accurate) and smaller variance (more precise)– mean squared error = MSE = (bias)2 + variance
October 2008 BMI Chair Talk © Brian S. Yandell 9
QTL mapping idea• observe phenotype y , marker genotypes m• genetic architecture identifies model
– number and location of QTL– gene action and epistasis (pairwise interactions)
• missing data: genotypes q at may be unknown– pr(q | m, , )– form of genotype model well known
• phenotype y depends on genotype q– pr(y | q, µ, )– often linear model in q– possible interactions among QTL (epistasis)
October 2008 BMI Chair Talk © Brian S. Yandell 10
October 2008 BMI Chair Talk © Brian S. Yandell 11
how does phenotype y improveguess of QTL genotypes q?
90
100
110
120
D4Mit41D4Mit214
Genotype
bp
AAAA
ABAA
AAAB
ABAB
what are probabilitiesfor genotype qbetween markers?
recombinants AA:AB
all 1:1 if ignore yand if we use y?
October 2008 BMI Chair Talk © Brian S. Yandell 12
Gibbs sampler for loci indicators
• QTL at pseudomarkers• loci indicators
= 1 if QTL present = 0 if no QTL present
• Gibbs sampler on loci indicators – relatively easy to incorporate epistasis– Yi et al. (2005, 2007 Genetics)
• (earlier work of Yi, Ina Hoeschele)
1,0 ,),()()( 2112122221110 kq qqqq
October 2008 BMI Chair Talk © Brian S. Yandell 13
likelihood and posterior
• likelihood relates “known” data (y,m,q) to unknown values of interest (,,)– pr(y,q|m,,,) = pr(y|q,,) pr(q|m,,)– mix over unknown genotypes (q)
• posterior turns likelihood into a distribution– weight likelihood by priors– rescale to sum to 1.0– posterior = likelihood * prior / constant
October 2008 BMI Chair Talk © Brian S. Yandell 14
Bayes theorem for QTLs
)|(pr)](pr),|(pr)|(pr),,|([pr * ),,|(pr),|,,,(pr
constant],,,for [prior *likelihood phenotype,,,for posterior
constantprior*likelihoodposterior
mymmqqymyq
October 2008 BMI Chair Talk © Brian S. Yandell 15
why use a Bayesian approach?• first, do both classical and Bayesian
– always nice to have a separate validation– each approach has its strengths and weaknesses
• classical approach works quite well– selects large effect QTL easily– directly builds on regression ideas for model selection
• Bayesian approach is comprehensive– samples most probable genetic architectures– formalizes model selection within one framework– readily (!) extends to more complicated problems
October 2008 BMI Chair Talk © Brian S. Yandell 16
Markov chain sampling• construct Markov chain around posterior
– posterior is stable distribution of Markov chain– use MC samples to estimate posterior
• sample QTL model unknowns from full conditionals– update unknowns one at a time or in batches
Nqqq
myqq
),,,(),,,(),,,(
),|,,,(pr~),,,(
21
October 2008 BMI Chair Talk © Brian S. Yandell 17
Bayes posterior vs. maximum likelihood• LOD: classical Log ODds
– maximize likelihood over effects µ– R/qtl scanone/scantwo: method = “em”
• LPD: Bayesian Log Posterior Density– average posterior over effects µ– R/qtl scanone/scantwo: method = “imp”
qmqqymy
Cdmym
cmy
),|(pr),|(pr),,|(pr
:genotypes QTL missingover mixes likelihood
})(pr),,|(pr)|(pr{log)(LPD
)},,|(pr{maxlog)(LOD
10
10
October 2008 BMI Chair Talk © Brian S. Yandell 18
LOD & LPD: 1 QTLn.ind = 100, 10 cM marker spacing
October 2008 BMI Chair Talk © Brian S. Yandell 19
marginal LOD or LPD• what is contribution of a QTL adjusting for all others?
– improvement in LPD due to QTL at locus – contribution due to main effects, epistasis, GxE?
• how does adjusted LPD differ from unadjusted LPD?– raised by removing variance due to unlinked QTL– raised or lowered due to bias of linked QTL– analogous to Type III adjusted ANOVA tests
• can ask these same questions using classical LOD– see Broman’s newer tools for multiple QTL inference
October 2008 BMI Chair Talk © Brian S. Yandell 20
1-QTL LOD vs. marginal LPD
1-QTL LOD
October 2008 BMI Chair Talk © Brian S. Yandell 21
hyper data: scanone
October 2008 BMI Chair Talk © Brian S. Yandell 22
what is best estimate of QTL?• find most probable pattern
– 1,4,6,15,6:15 has posterior of 3.4%• estimate locus across all nested patterns
– Exact pattern seen ~100/3000 samples– Nested pattern seen ~2000/3000 samples
• estimate 95% confidence interval using quantiles
> best <- qb.best(qbHyper)> summary(best)$best
chrom locus locus.LCL locus.UCL n.qtl247 1 69.9 24.44875 95.7985 0.8026667245 4 29.5 14.20000 74.3000 0.8800000248 6 59.0 13.83333 66.7000 0.7096667246 15 19.5 13.10000 55.7000 0.8450000
> plot(best)Manichaikul et al. 2008Genetics (in review)
October 2008 BMI Chair Talk © Brian S. Yandell 23
what patterns are “near” the best?
• size & shade ~ posterior• distance between patterns
– sum of squared attenuation– match loci between patterns– squared attenuation = (1-2r)2
– sq.atten in scale of LOD & LPD• multidimensional scaling
– MDS projects distance onto 2-D– think mileage between cities
October 2008 BMI Chair Talk © Brian S. Yandell 24
Software for Bayesian QTLsR/qtlbim: www.qtlbim.org
• Properties– cross-compatible with R/qtl– new MCMC algorithms
• Gibbs with loci indicators; no reversible jump– epistasis, fixed & random covariates, GxE– extensive graphics
• Software history– initially designed (Satagopan, Yandell 1996)– major revision and extension (Gaffney 2001)– R/bim to CRAN (Wu, Gaffney, Jin, Yandell 2003)– R/qtlbim to CRAN (Yi, Yandell et al. 2006)
• Publications– Yi et al. (2005); Yandell et al. (2007); Yi et al. (2007ab)
October 2008 BMI Chair Talk © Brian S. Yandell 25
glucose insulin
(courtesy AD Attie)
BTBR mouse isinsulin resistant
B6 is not
make both obese…
October 2008 BMI Chair Talk © Brian S. Yandell 26
studying diabetes in an F2• mouse model: segregating panel from inbred lines
– B6.ob x BTBR.ob F1 F2– selected mice with ob/ob alleles at leptin gene (Chr 6)– sacrificed at 14 weeks, tissues preserved
• physiological study (Stoehr et al. 2000 Diabetes)– mapped body weight, insulin, glucose at various ages
• gene expression studies– RT-PCR for a few mRNA on 108 F2 mice liver tissues
• (Lan et al. 2003 Diabetes; Lan et al. 2003 Genetics)– Affymetrix microarrays on 60 F2 mice liver tissues
• U47 A & B chips, RMA normalization• design: selective phenotyping (Jin et al. 2004 Genetics)
October 2008 BMI Chair Talk © Brian S. Yandell 27
log10(ins10)Chr 19
black=allblue=malered=female purple=sex-
adjusted
solid=512 micedashed=311 mice
October 2008 BMI Chair Talk © Brian S. Yandell 28
Sorcs1 studyin mice:
11 sub-congenic strains
marker regressionmeta-analysis
within-strain permutations
Nature Genetics 2006Clee, Yandell et al.
October 2008 BMI Chair Talk © Brian S. Yandell 29
we were lucky!
BTBR backgroundneeded to see SORCS1
epistatic interactionof chr 19 and 8
…discovered much later 0.8
1.0
1.2
1.4
1.6
1.8
2.0
Interaction plot for D19Mit58 and D8Mit289
D8Mit289
logi
ns10
AA AB BB
AAABBB
D19Mit58
1.0
1.2
1.4
1.6
1.8
Interaction plot for D19Mit58 and D17Mit180
D17Mit180
logi
ns10
AA AB BB
AAABBB
D19Mit58
October 2008 BMI Chair Talk © Brian S. Yandell 30
Sorcs1 gene & SNPs
October 2008 BMI Chair Talk © Brian S. Yandell 31
Sorcs1 study in humans
Diabetes 2007Goodarzi et al.
October 2008 BMI Chair Talk © Brian S. Yandell 32
2M observations30,000 traits60 mice
October 2008 BMI Chair Talk © Brian S. Yandell 33
experimental context• B6 x BTBR obese mouse cross
– model for diabetes and obesity– 500+ mice from intercross (F2)– collaboration with Rosetta/Merck
• genotypes– 5K SNP Affymetrix mouse chip– care in curating genotypes! (map version, errors, …)
• phenotypes– clinical phenotypes (>100 / mouse)– gene expression traits (>40,000 / mouse / 4-6 tissues)– other molecular traits (proteomic, miRNA, metabolomic)
October 2008 BMI Chair Talk © Brian S. Yandell 34
QTL mapping
thousandsof gene
expression traits
PLoS Genetics2006 Lan, Chen et al.
October 2008 BMI Chair Talk © Brian S. Yandell 35
red=transblue=cis
QTLs on chr n
gray scale forvariance
October 2008 BMI Chair Talk © Brian S. Yandell 36
Chaibub Neto et al. (2008)Genetics
October 2008 BMI Chair Talk © Brian S. Yandell 37
causal phenotype networks• goal: mimic biochemical pathways with
directed (causal) networks• problem: association (correlation) does not
imply causation• resolution: bring in driving causes
– genotypes (at conception)– processes earlier in time
October 2008 BMI Chair Talk © Brian S. Yandell 38
Causal vs Reactive? (Elias Chaibub, Brian Yandell)y1 causes y2: y1 ~ g1 and y2 ~ g2*y1
October 2008 BMI Chair Talk © Brian S. Yandell 39
Ferrara et al.
October 2008 BMI Chair Talk © Brian S. Yandell 40
inferring phenotype networks
• build in prior pathway knowledge (PPI, TF)– co-map correlated traits
• Banerjee, Yandell, Yi (2008 Genetics)– pathways induce correlation structure
• ramp up to 100s, 1000s of phenotypes?– danger of mixing unrelated pathways– want closely linked upstream (causal) drivers
October 2008 BMI Chair Talk © Brian S. Yandell 41
Rosetta:Schadt,
Zhang, Zhu
UAB:Allison,
Yi stat/hort:Yandell
BMI:Kendziorski,
Broman,Craven
Jax:Churchill,von Smith
Duke:Newgaard,
Ferrara
biochem:Attie,
Keller, Zhu
October 2008 BMI Chair Talk © Brian S. Yandell 42
why build Web eQTL tools?
• common storage/maintainence of data– one well-curated copy – central repository– reduce errors, ensure analysis on same data
• automate commonly used methods– biologist gets immediate feedback– statistician can focus on new methods– codify standard choices
October 2008 BMI Chair Talk © Brian S. Yandell 43
how does one build tools?• no one solution for all situations• use existing tools wherever possible
– new tools take time and care to build!– downloaded databases must be updated regularly
• human component is key– need informatics expertise– need continual dialog with biologists
• build bridges (interfaces) between tools– Web interface uses PHP– commands are created dynamically for R
• continually rethink & redesign organization
October 2008 BMI Chair Talk © Brian S. Yandell 44
steps in using Web tools
• user enters data on Web page• PHP tool interprets user data• PHP builds R script• R run on script
– creates plots, summaries, warnings• PHP grabs results & displays on page• user examines, saves• user modifies data and reruns
October 2008 BMI Chair Talk © Brian S. Yandell 45
raw data or fancy results?
• raw data flexible but slow– LOD profiles for 100 (1000) traits?
• fancy results from sophisticated analysis– IM, MIM, BIM, MOM analysis– too complicated to put in biologists’ hands?
• methods are unrefined, state-of-art, research tools• use of methods involved many subtle choices
– batch computation over weeks• compute once, save, display many times
October 2008 BMI Chair Talk © Brian S. Yandell 46
October 2008 BMI Chair Talk © Brian S. Yandell 47
LOD profiles: many traits
October 2008 BMI Chair Talk © Brian S. Yandell 48
1.5 LOD interval approximate 95% CI
October 2008 BMI Chair Talk © Brian S. Yandell 49
red=transblue=cis
QTLs on chr n
gray scale forvariance
October 2008 BMI Chair Talk © Brian S. Yandell 50
what challenges remain?• from eQTL to candidate pathways
– statistical issues• networks, correlated traits• better model selection approaches
– biological evidence (Weiss 2007 Genetics)• Mouse to human to mouse• KOs, etc.
• upgrade informatics environment– harden local code (R, Python, PHP, …)– build on other high throughput systems
• Swertz, Jansen (2007); Stein (2008) Nat Rev Gen
October 2008 BMI Chair Talk © Brian S. Yandell 51
many thanksKarl BromanJackson Labs
Gary ChurchillHao WuRandy von Smith
U AL BirminghamDavid Allison Nengjun YiTapan MehtaSamprit BanerjeeRam VenkataramanDaniel Shriner
Michael NewtonHyuna Yang
Daniel SorensenDaniel Gianola
Liang Li
my studentsJaya SatagopanFei ZouPatrick GaffneyChunfang JinElias Chaibub NetoW Whipple NeelyJee Young Moon
USDA Hatch, NIH/NIDDK (Attie), NIH/R01 (Yi, Broman)
Tom OsbornDavid ButruilleMarcio FerreraJosh UdahlPablo Quijada
Alan AttieJonathan StoehrHong LanSusie CleeJessica ByersMark Keller