dirichlet process bayesian clustering with the r package ... · motivation i highly correlated risk...
TRANSCRIPT
Dirichlet process Bayesian clusteringwith the R package PReMiuM
Dr Silvia LiveraniBrunel University London
July 2015
Silvia Liverani (Brunel University London) Profile Regression 1 / 18
Outline
I MotivationI MethodI R package PReMiuM
I Examples
Silvia Liverani (Brunel University London) Profile Regression 2 / 18
Many collaborators
I John Molitor (University of Oregon)I Sylvia Richardson (Medical Research Centre Biostatistics Unit)I Michail Papathomas (University of St Andrews)I David HastieI Aurore Lavigne (University of Lille 3, France)I Lucy Leigh (University of Newcastle, Australia)I . . .
Silvia Liverani (Brunel University London) Profile Regression 3 / 18
Motivation
Multicollinearity
I Goal of epidemilogical studies is to investigate the joint effect ofdifferent covariates / risk factors on a phenotype...
I ... but highly correlated risk factors create collinearity problems!
ExampleResearchers are interested in determining if a relationship existsbetween blood pressure (y = BP, in mm Hg) and
I weight (x1 = Weight, in kg)I body surface area (x2 = BSA, in sq m)I duration of hypertension (x3 = Dur, in years)I basal pulse (x4 = Pulse, in beats per minute)I stress index (x5 = Stress)
Silvia Liverani (Brunel University London) Profile Regression 4 / 18
Motivation
Multicollinearity
I Goal of epidemilogical studies is to investigate the joint effect ofdifferent covariates / risk factors on a phenotype...
I ... but highly correlated risk factors create collinearity problems!
ExampleResearchers are interested in determining if a relationship existsbetween blood pressure (y = BP, in mm Hg) and
I weight (x1 = Weight, in kg)I body surface area (x2 = BSA, in sq m)I duration of hypertension (x3 = Dur, in years)I basal pulse (x4 = Pulse, in beats per minute)I stress index (x5 = Stress)
Silvia Liverani (Brunel University London) Profile Regression 4 / 18
Motivation
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●● ●
●
●●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
● ●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
86 88 90 92 94 96 98 100
01
23
Weight
BS
A
BP=y , Weight = x1, BSA = x2
Silvia Liverani (Brunel University London) Profile Regression 5 / 18
Motivation
I Highly correlated risk factors create collinearity problems, causinginstability in model estimation
Model β̂1 SE β̂1 β̂2 SE β̂2
y ∼ x1 2.64 0.30 – –y ∼ x2 – – 3.34 1.33y ∼ x1 + x2 6.58 0.53 -20.44 2.28
I Effect 1: the estimated regression coefficient of any one variabledepends on which other predictor variables are included in themodel.
I Effect 2: the precision of the estimated regression coefficientsdecreases as more predictor variables are added to the model.
Silvia Liverani (Brunel University London) Profile Regression 6 / 18
Motivation
I Highly correlated risk factors create collinearity problems, causinginstability in model estimation
Model β̂1 SE β̂1 β̂2 SE β̂2
y ∼ x1 2.64 0.30 – –y ∼ x2 – – 3.34 1.33y ∼ x1 + x2 6.58 0.53 -20.44 2.28
I Effect 1: the estimated regression coefficient of any one variabledepends on which other predictor variables are included in themodel.
I Effect 2: the precision of the estimated regression coefficientsdecreases as more predictor variables are added to the model.
Silvia Liverani (Brunel University London) Profile Regression 6 / 18
Profile Regression
Issues caused byI correlated risk factorsI interacting risk factors
Profile regressionI partitions the multi-dimensional risk surface into groups
having similar risksI investigation of the joint effects of multiple risk factorsI jointly models the covariate patterns and health outcomesI flexible but tractable Bayesian model
Silvia Liverani (Brunel University London) Profile Regression 7 / 18
Profile Regression
Issues caused byI correlated risk factorsI interacting risk factors
Profile regressionI partitions the multi-dimensional risk surface into groups
having similar risksI investigation of the joint effects of multiple risk factorsI jointly models the covariate patterns and health outcomesI flexible but tractable Bayesian model
Silvia Liverani (Brunel University London) Profile Regression 7 / 18
Profile Regression
Notation
For individual i
yi outcome of interestxi = (xi1, . . . , xiP) covariate profilewi fixed effectszi = c the allocation variable indicates the
cluster to which individual i belongs
Silvia Liverani (Brunel University London) Profile Regression 8 / 18
Profile Regression
Statistical Framework
I Joint covariate and response model
f (xi , yi |φ, θ, ψ, β) =∑
c
ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)
I For example for discrete covariates
f (xi |zi = c, φc) =J∏
j=1
φzi ,j,xi,j
I For example, for Bernoulli response
logit{p(yi = 1|θc , β,wi)} = θc + βT wi
Silvia Liverani (Brunel University London) Profile Regression 9 / 18
Profile Regression
Statistical Framework
I Joint covariate and response model
f (xi , yi |φ, θ, ψ, β) =∑
c
ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)
I For example for discrete covariates
f (xi |zi = c, φc) =J∏
j=1
φzi ,j,xi,j
I For example, for Bernoulli response
logit{p(yi = 1|θc , β,wi)} = θc + βT wi
Silvia Liverani (Brunel University London) Profile Regression 9 / 18
Profile Regression
Statistical Framework
I Joint covariate and response model
f (xi , yi |φ, θ, ψ, β) =∑
c
ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)
I Prior model for the mixture weights ψcI stick-breaking priors (constructive definition of the Dirichlet
Process)P(Zi = c|ψ) = ψc ψ1 = V1
ψc = Vc
∏l<c
(1− Vl) Vc ∼ Beta(1, α)
I larger concentration parameter α the more evenly distributed is theresulting distribution.
I smaller concentration parameter α the more sparsely distributed isthe resulting distribution, with all but a few parameters having aprobability near zero
Silvia Liverani (Brunel University London) Profile Regression 10 / 18
R package PReMiuM
Implementation: R package PReMiuM
We have implemented profile regression in C++ within the R packagePReMiuM.
I Binary, binomial, categorical, Normal, Poisson and survivaloutcome
I Allows for spatial correlationI Fixed effects (global parameters) including also spatial CAR termI Normal and/or discrete covariatesI Dependent or independent slice sampling (Kalli et al., 2011) or
truncated Dirichlet process model (Ishwaran and James, 2001)I Fixed alpha or update alpha, or use the Pitman-Yor process priorI Handles missing data
Silvia Liverani (Brunel University London) Profile Regression 11 / 18
R package PReMiuM
Implementation: R package PReMiuM
We have implemented profile regression in C++ within the R packagePReMiuM.
I Allows users to run predictive scenariosI Performs post processingI Contains plotting functions
Currently working on:I Quantile profile regressionI Enriched Dirichlet processes
Silvia Liverani (Brunel University London) Profile Regression 12 / 18
Applications and features of the model
Example: Simulated data
The profiles are given byy : outcome, Bernoullix : 5 covariates, all discrete with 3 levelsw : 2 fixed effects, continuous or discrete
Silvia Liverani (Brunel University London) Profile Regression 13 / 18
Applications and features of the model
Survival response with censoring: sleep study
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Time
Sur
viva
l
Silvia Liverani (Brunel University London) Profile Regression 15 / 18
Variable selection
Applications and features of the model
Spatial correlated response: deprivation in London
[−39.1,−7.73]
(−7.73,−2.04]
(−2.04,2.47]
(2.47,7.88]
(7.88,31.1]
Silvia Liverani (Brunel University London) Profile Regression 17 / 18
Applications and features of the model
References
I Liverani, S., Hastie, D. I., Azizi, L., Papathomas, M. and Richardson, S. (2015)PReMiuM: An R package for Profile Regression Mixture Models using DirichletProcesses. Journal for Statistical Software, 64(7), 1-30.
I J. T. Molitor, M. Papathomas, M. Jerrett and S. Richardson (2010) Bayesian ProfileRegression with an Application to the National Survey of Childrens Health, Biostatistics,11, 484-498.
I Molitor, J., Brown, I. J., Papathomas, M., Molitor, N., Liverani, S., Chan, Q., Richardson, S.,Van Horn, L., Daviglus, M. L., Stamler, J. and Elliott, P. (2014) Blood pressure differencesassociated with DASH-like lower sodium compared with typical American higher sodiumnutrient profile: INTERMAP USA. Hypertension.
I Hastie, D. I., Liverani, S. and Richardson, S. (2015) Sampling from Dirichlet processmixture models with unknown concentration parameter: Mixing issues in large dataimplementations. To appear in Statistics and Computing.
I M. Papathomas, J. Molitor, S. Richardson, E. Riboli and P. Vineis (2011) Examining thejoint effect of multiple risk factors using exposure risk profiles: lung cancer in non smokers.Environmental Health Perspectives, 119, 84-91.
I Papathomas, M , Molitor, J, Hoggart, C, Hastie, D and Richardson, S (2012) Exploring datafrom genetic association studies using Bayesian variable selection and the Dirichletprocess: application to searching for gene-gene patterns. Genetic Epidemiology .
Silvia Liverani (Brunel University London) Profile Regression 18 / 18