computational science & engineering seminar, 16 oct 2013

Download computational science & engineering seminar, 16 oct 2013

Post on 25-Jan-2015

267 views

Category:

Technology

1 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • 1. DenitionsStatisticsComputationSoftwareConclusionsReferencesStatistics, computation, and software engineering: development and maintenance of mixed modeling software in R Ben BolkerMcMaster University, Mathematics & Statistics and Biology 15 October 2013Ben Bolker Mixed model software

2. DenitionsStatisticsComputationOutline 1Denitions and context2Statistical challenges3Computational challenges4Software engineering5ConclusionsBen Bolker Mixed model softwareSoftwareConclusionsReferences 3. DenitionsStatisticsComputationOutline 1Denitions and context2Statistical challenges3Computational challenges4Software engineering5ConclusionsBen Bolker Mixed model softwareSoftwareConclusionsReferences 4. DenitionsStatisticsComputationSoftwareConclusionsReferences(Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating:Linear combinations of categorical and continuous predictors, and interactions Response distributions in theexponential family(binomial, Poisson, and extensions) Any smooth, monotoniclink function(e.g. logistic, exponential models) Flexible combinations ofblocking factors(clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . .Ben Bolker Mixed model software 5. DenitionsStatisticsComputationSoftwareConclusionsReferences(Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating:Linear combinations of categorical and continuous predictors, and interactions Response distributions in theexponential family(binomial, Poisson, and extensions) Any smooth, monotoniclink function(e.g. logistic, exponential models) Flexible combinations ofblocking factors(clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . .Ben Bolker Mixed model software 6. DenitionsStatisticsComputationSoftwareConclusionsReferences(Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating:Linear combinations of categorical and continuous predictors, and interactions Response distributions in theexponential family(binomial, Poisson, and extensions) Any smooth, monotoniclink function(e.g. logistic, exponential models) Flexible combinations ofblocking factors(clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . .Ben Bolker Mixed model software 7. DenitionsStatisticsComputationSoftwareConclusionsReferences(Generalized) linear mixed models (G)LMMs: a statistical modeling framework incorporating:Linear combinations of categorical and continuous predictors, and interactions Response distributions in theexponential family(binomial, Poisson, and extensions) Any smooth, monotoniclink function(e.g. logistic, exponential models) Flexible combinations ofblocking factors(clustering; random eects) Applications in ecology, neurobiology, behaviour, epidemiology, real estate, . . .Ben Bolker Mixed model software 8. DenitionsStatisticsComputationSoftwareConclusionsReferencesExamplesecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (studentsteachers)psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries)Ben Bolker Mixed model software 9. DenitionsStatisticsComputationSoftwareConclusionsReferencesExamplesecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (studentsteachers)psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries)Ben Bolker Mixed model software 10. DenitionsStatisticsComputationSoftwareConclusionsReferencesExamplesecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (studentsteachers)psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries)Ben Bolker Mixed model software 11. DenitionsStatisticsComputationSoftwareConclusionsReferencesExamplesecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (studentsteachers)psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries)Ben Bolker Mixed model software 12. DenitionsStatisticsComputationSoftwareConclusionsReferencesExamplesecology survival, predation, etc. (experimental plots) genomics presence/absence of polymorphisms, gene expression (individuals) educational assessment student scores (studentsteachers)psychology/sensometrics decisions, responses to stimuli (individuals) epidemiology disease prevalence (postal codes, provinces, countries)Ben Bolker Mixed model software 13. DenitionsStatisticsComputationSoftwareConclusionsTechnical denition conditional distribution YiDistrresponse linear predictorbconditional modesBen Bolker Mixed model software=X xed eects(g 1 ( ), i)scale inverse parameter link function+Zbrandom eects MVN(0, () ) variancecovariance matrixReferences 14. DenitionsStatisticsComputationOutline 1Denitions and context2Statistical challenges3Computational challenges4Software engineering5ConclusionsBen Bolker Mixed model softwareSoftwareConclusionsReferences 15. DenitionsStatisticsComputationSoftwareConclusionsReferencesEstimation Maximum likelihood estimation L(Y |, ) = ilikelihooddeterministic:L(Y |, ) i L( |()) d data|random eectsrandom eectsprecision vs. computational cost:penalized quasi-likelihood, Laplace approximation, adaptive Gauss-Hermite quadrature (Breslow, 2004) . . .Monte Carlo:frequentist and Bayesian (Booth and Hobert,1999; Ponciano et al., 2009; Sung, 2007)Ben Bolker Mixed model software 16. DenitionsStatisticsComputationSoftwareConclusionsEstimation: example (McKeon et al., 2012) Logodds of predation 64202q q q q qAdded symbiont qq q q qCrab vs. Shrimp qSymbiontBen Bolker Mixed model softwareq qqqGLM (fixed) GLM (pooled) PQL Laplace AGQReferences 17. DenitionsStatisticsComputationSoftwareConclusionsReferencesInference 0.02 H2S0.06 Anoxia 0.08mostly asymptotic or0.06uncontrolled approximations0.04Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problemsInferred p valueStandard inferential tools:0.02 OsmCu0.08 0.06 0.04 0.02for small vs large data 0.020.06True p valueBen Bolker Mixed model software 18. DenitionsStatisticsComputationSoftwareConclusionsReferencesInference 0.02 H2S0.06 Anoxia 0.08mostly asymptotic or0.06uncontrolled approximations0.04Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problemsInferred p valueStandard inferential tools:0.02 OsmCu0.08 0.06 0.04 0.02for small vs large data 0.020.06True p valueBen Bolker Mixed model software 19. DenitionsStatisticsComputationSoftwareConclusionsReferencesInference 0.02 H2S0.06 Anoxia 0.08mostly asymptotic or0.06uncontrolled approximations0.04Solutions are computational and/or Bayesian: parametric bootstrap, MCMC Good news: dierent problemsInferred p valueStandard inferential tools:0.02 OsmCu0.08 0.06 0.04 0.02for small vs large data 0.020.06True p valueBen Bolker Mixed model software 20. DenitionsStatisticsComputationOutline 1Denitions and context2Statistical challenges3Computational challenges4Software engineering5ConclusionsBen Bolker Mixed model softwareSoftwareConclusionsReferences 21. DenitionsStatisticsComputationSoftwareProblems of big dataHow big is big? Airline data: 12G (G)LMM works onmoderately large problems,e.g. student evaluations ( 75K total, 3K students, 1K profs) Fairly clever linear algebra Possible improvements? Chunking/parallelization Out-of-memory operationBen Bolker Mixed model softwareConclusionsReferences 22. DenitionsStatisticsComputationSparse matrix algorithmsrepeated decomposition oflarge, matrices (especially Z ) ll-reducing permutation to improve sparsity pattern further improvements possible: better matrix representation, parallelization?Ben Bolker Mixed model softwareSoftwareConclusionsReferences 23. DenitionsStatisticsComputationSoftwareConclusionsReferencesBounded optimization rawParameterize variance-covariance matrixlog30()Positive denite or only semi-denite? Disadvantages of transformingdeviance(Pinheiro and Bates, 1996) 2010to unconstrain (Disadvantages of boundary solutions)Ben Bolker Mixed model software0 0123 3210 24. DenitionsStatisticsComputationOutline 1Denitions and context2Statistical challenges3Computational challenges4Software engineering5ConclusionsBen Bolker Mixed model softwareSoftwareConclusionsReferences 25. DenitionsStatisticsComputationSoftwareConclusionsLanguage tradeoshigh-level/convenience: R low-level/performance: C++ new wave? Julia multi-language friction: mostly escaped in R/C++ case, at the price of complexityBen Bolker Mixed model softwareReferences 26. DenitionsStatisticsComputationSoftwareGetting it right vs. getting it writtenthe curse of neophilia: Superiority many versions:nlme, lme4(a,b,Eigen)The moral of the story is that if you want to create a beautiful language, for god's sake don't make it useful (Patrick Burns)Ben Bolker Mixed model software...ConclusionsReferences 27. DenitionsStatisticsComputationSoftwareConclusionsReferencesSociological issues Wide user base: As usual when software for complicated statistical inference procedures is broadly disseminated, there is potential for abuse and misinterpretation. (Breslow, 2004)What if there is no good answer? do no harm vs. better me than someone else Diagnostics and warning messages End usersBen Bolker Mixed model softwar