likelihood-based finite mixture models for ordinal data · 2020. 2. 27. · 1. model-based...

Likelihood-Based Finite Mixture Models forOrdinal Data

Daniel FernandezFundacio Sant Joan de Deu ⇒ Universitat Politecnica

de Catalunya

df.martinez@pssjd.org

Seminari del Servei d’Estadıstica Aplicada & Grup de RecercaAdvanced Stochastic Modelling

Universitat Autonoma de BarcelonaFeb 27th, 2020

Acknowledgments

I Servei d’Estadıstica Aplicada & Grup de Recerca AdvancedStochastic Modelling – Universitat Autonoma de Barcelona

I School of Mathematics and Statistics at Victoria University ofWellington, New Zealand

Prof. Richard Arnold Prof. Emer. Shirley Pledger AProf. Ivy Liu

Outline

1 Background.I Ordinal data. Motivation.I Stereotype model. Definition and standard models.I Stereotype model including clustering.

2 Model fitting.

3 Example.I Level of depression data.I Ordinal data visualization: Spaced mosaic plots and fuzziness.

4 Bayesian inference approach: RJMCMC.

5 Summary.

1. Ordinal data

The response variable has ordinal categorical scalesOrdinal data is widely used in areas such as marketing, social,medical and ecological science.

I Pain scale:

I Likert scale: “strongly disagree”, “disagree”, “agree”, or“strongly agree” in a survey.

I Braun-Blanquet cover-abundance scale is very common invegetation analysis.

I Degree of dissimilarity among the different levels of the scaleis not necessarily always the same.

1. Ordinal Data and Goal

I Data represented as a matrix Y with dimensions n ×m (ncould be questions, m could be subjects) where

yij ∈ 1, . . . , q i = 1, . . . , n j = 1, . . . ,m q categories.

I Note: no covariates available.I For example, questionnaires to assess levels of depression:

I n = 13 questions (rows).I m = 151 individuals (columns).I q = 4 categories: 1 to 4, with higher scores indicating higher levels

of depression.

1. Ordinal Data and Goal

I For example, questionnaires to assess the level of depressionI n = 13 questions (rows).I m = 151 individuals (columns).I q = 4 categories: 1 to 4, with higher scores indicating higher levels

of depression.

I Goals:I Can we group patients/questions together?I Which questions or patients tend to be linked with higher

values of the ordinal response?

1. Motivation

I Minimal research on methods of clustering focusing onordinal data.

I Most of the current methods based on mathematicaltechniques (e.g. distance-based algorithms) ⇒ Neitherstatistical inference nor model selection.

I Recent work (Fernandez et al, 2016): fuzzy biclustering viafinite mixtures model for ordinal data ⇒ Statistical inferenceand model selection.

1. Motivation

Source: David Sontag, NYU

I Clusters may overlapI Some clusters may be

”wider” than othersI Distances can be

deceiving!I Try a probabilistic model

I allows overlapsI allows clusters of

different sizeI allows a soft/fuzzy

clustering

1. Motivation

Hard clustering Fuzzy clustering

1. Model-based clustering

I Model-based clustering: process of clustering via statisticalmodels, typically Finite Mixture Models (FMM).

I Finite mixture models: a way of clustering in order to reducedimensionality and identifying patterns related to theheterogeneity of the data (e.g. rows/columns with similareffect on the response)

red line - what we see

I Model-based clustering: process of clustering via statisticalmodels, typically Finite Mixture Models (FMM).

I Finite mixture models: a way of clustering in order to reducedimensionality and identifying patterns related to theheterogeneity of the data (e.g. rows/columns with similareffect on the response)

I Our research: model-based clustering for ordinal data, withcomponents within the FMM ⇒ Stereotype model.

1. Stereotype Model. Formulation

I Stereotype model (Anderson, J. A., 1984):

P [yij = k | x]P [yij = 1 | x]

)= µk + (φkβ′)x k = 2, . . . , q

q − 1 log odds for categories k and 1. First category as abaseline.

I β: Assumes the parameter of the predictor regarding thecovariates is the same for all categories.

I φk : “score” for the response category k.

I Stereotype model:

P [yij = k | x]P [yij = 1 | x]

)= µk + (φkβ′)x k = 2, . . . , q

I Nothing in the stereotype model treats the response asordinal.

I Including an increasing order constraint (Anderson, J.A.,1984):

0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1 ,

captures the ordinal nature of the outcomes.I The model has received more attention, after Agresti (2010,

Ch.4) discussed the model in his book.

I Stereotype model:

P [yij = k | x]P [yij = 1 | x]

)= µk + (φkβ′)x k = 2, . . . , q

I Nothing in the stereotype model treats the response asordinal.

I Including an increasing order constraint (Anderson, J.A.,1984):

0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1 ,

captures the ordinal nature of the outcomes.I The model has received more attention, after Agresti (2010,

Ch.4) discussed the model in his book.

1. Stereotype Model. Scores φk Interpretation

Use the fitted score parameters φk: determining the spacingamong categories.

Level of depression data: φ0 = 0, φ1 = 0.347, φ2 = 0.853, φ3 = 1.

Stereotype model for categories a and b:

P [yij = a | x]P [yij = b | x]

)= log

(P [yij = a | x] /P [yij = 1 | x]P [yij = b | x] /P [yij = 1 | x]

)= (µa − µb) + (φa − φb)β′x .

0 = φ1 ≤ . . . φa ≤ · · · ≤ φb . . . ≤ φq = 1

I The larger the difference (φa − φb) ⇒ The more the odds of aand b are influenced by x

Stereotype model for categories a and b:

P [yij = a | x]P [yij = b | x]

)= log

(P [yij = a | x] /P [yij = 1 | x]P [yij = b | x] /P [yij = 1 | x]

)= (µa − µb) + (φa − φb)β′x .

0 = φ1 ≤ . . . φa ≤ · · · ≤ φb . . . ≤ φq = 1

I If φa = φb ⇒ the logit is the constant µa − µb⇒ The covariates x do not distinguish between a and b⇒ We could collapse the categories a and b in our data.

1. Stereotype Model. Software

I Stereotype model

I STATA module called SOREG (Lunt, 2001)I R package called ordinalgmifs (Archer et al. 2014)I R package VGAM (Yee, 2008) – it is not able to add the

monotonic constraint in the scoreI R package called clustord (Fernandez and Ryan, soon in

CRAN) https://github.com/vuw-clustering/clustordparameters.

1. Stereotype Model. Main effects

I Stereotype model:

P [yij = k | x]P [yij = 1 | x]

)= µk + φkβ′x k = 2, . . . , q

I Build up β′x considering row and column effect of the yij(Fernandez et al. 2016).

1. Stereotype Model. Main effects

I Build up β′x considering row and column effect of the yij(Fernandez et al. 2016).

I Main effects model:

P [yij = k]P [yij = 1]

)= µk + φk(αi + βj)

k = 2, . . . , q i = 1, . . . , n j = 1, . . . ,m

I αi : interpreted as the effect of the rows.I βj : interpreted as the effect of the columns.I Identifiability constraints:

∑i αi =

∑j βj = 0, µ1 = 0, and

0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.

I Main effect model 2q + n + m− 5 independent parameters.

)= µk + φk(αi + βj) k=2,. . . ,q

i=1,. . . ,n j=1,. . . ,m

I Avoid αi + βj that overspecifies the data structure ⇒Clustering via finite mixtures models in order to reducedimensionality (McLachlan, G. and Peel, D., 2000).

1. Model-based clustering - Column clustering

For example column clustering:

We change from main effects model

)= µk + φk(αi + βj) j = 1, . . . ,m

P [yij = k | j ∈ c]P [yij = 1 | j ∈ c]

)= µk + φk(αi + βc) c = 1, . . . ,C < m

where βc is interpreted as the effect of the column cluster c.

For example column clustering:

We change from main effects model

)= µk + φk(αi + βj) j = 1, . . . ,m

P [yij = k | j ∈ c]P [yij = 1 | j ∈ c]

)= µk + φk(αi + βc) c = 1, . . . ,C < m

where βc is interpreted as the effect of the column cluster c.

1. Model-based clustering. Biclustering

I General formulation of model-based clustering(biclustering):

P [yij = k | i ∈ r , j ∈ c]P [yij = 1 | i ∈ r , j ∈ c]

)= µk+φk(αr + βc) k = 2, . . . , q

I αr : interpreted as the effect of the row cluster r .I βc : interpreted as the effect of the column cluster c.I Constraints: α1 = β1 = 0 (or

∑αr =

∑βc = 0) and

0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.

I The formulation is similar to a latent class model.I Further, αr + βc can be extended to αr + βc + γrc .I The model provides a simultaneous fuzzy clustering of the

rows and columns.

)= µk+φk(αr + βc) k = 2, . . . , q

I αr : interpreted as the effect of the row cluster r .I βc : interpreted as the effect of the column cluster c.I Constraints: α1 = β1 = 0 (or

∑αr =

∑βc = 0) and

0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.

I The formulation is similar to a latent class model.I Further, αr + βc can be extended to αr + βc + γrc .I The model provides a simultaneous fuzzy clustering of the

rows and columns.

Main effects model for stereotype model had likelihood:

L(Ω | yij) =n∏

m∏j=1

q∏k=1

(P[yij = k])I[yij =k]

and with Column clustering model turns into:

L(Ω | yij) =m∏

[ C∑c=1

n∏i=1

q∏k=1

(P[yic = k])I[yij =k]]

where κc is the proportion of columns in column group c.

I Problem: Missing information.I We do not know the actual membership in columns (rows) nor

the number of columns (rows).

κc is the proportion of columns in column group c.

2. Model fitting

Model fitting

2. Model fitting

I EM algorithm for finding the ML solution for the parametersof models with missing information (the actual unknowncluster membership of each row and column).

I Information criteria (AIC, BIC,...)

I Comprehensive simulation study (4500 scenarios) testing 12information criteria (Fernandez and Arnold, 2016)

2. Model fitting

Table: Information criteria summary table

Criteria Definition Proposed for Depending on

AIC −2` + 2K

Regression models

Number of parameters

AICc AIC + 2K(K+1)n−K−1

AICu AICc + n log( nn−K−1 ) Number of parameters

and sample sizeCAIC −2` + K(1 + log(n))

BIC −2` + K log(n)

AIC3 −2` + 3K

Clustering

Number of parameters

CLC −2` + 2EN(R)Entropy

NEC(R) EN(R)`(R)−`(1)

ICL-BIC BIC + 2EN(R) Number of parameters, sample sizeand entropy

AWE −2`c + 2K(3/2 + log(n))

L −`− K2

∑log( nπR

12 )− Number of parameters, sample sizeR

2 log( n12 ) −

R(K+1)2 and mixing proportions

Notes: n represents the sample size, K the number of parameters, R the number of clusters, πR the mixing clusterproportion, ` the log-likelihood and EN(·) the entropy function.

2. Model fitting. Simulation study

I Simulated data with true number of row clusters.

I General results: Percentage of cases for each criteriondetermines the true number of row clusters (Fit).

2. Model fitting. One-dimensional Clustering

Table: Top 5. Overall results. One-dimensional clustering

Overall Results Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5AIC 93.8% 91.4% 97.6% 88.0% 92.9% 99.1%AICc 89.8% 90.2% 94.8% 74.7% 91.1% 98.2%AICu 82.4% 79.0% 80.0% 66.7% 88.0% 98.2%AIC3 67.7% 61.7% 65.6% 56.7% 56.4% 98.2%BIC 43.7% 41.2% 39.1% 40.0% 39.6% 58.7%

2. Model fitting. Biclustering

Table: Top 5. Overall results. Biclustering

Overall Results Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5AIC 86.1% 89.2% 82.3% 80.5% 85.5% 92.8%AICc 85.6% 89.2% 81.5% 80.0% 84.5% 92.8%AICu 84.2% 84.8% 80.7% 79.3% 83.3% 92.8%AIC3 71.2% 75.8% 65.5% 64.7% 66.5% 83.3%BIC 36.5% 34.5% 35.2% 33.5% 32.3% 47.2%

2. Model fitting

I EM algorithm for finding the ML solution for the parametersof models with missing information (the actual unknowncluster membership of each row and column).

I Information criteria (AIC, BIC,...)

I Comprehensive simulation study (4500 scenarios) testing 12information criteria (Fernandez and Arnold, 2016) ⇒ AIC isthe best criterion

2. Model fitting

I Two possible Bayesian approaches:I “Fixed” dimension: Metropolis-Hastings and Gibbs sampler.I Variable dimension: Reversible Jump MCMC (RJMCMC,

Green, P. J., 1995)

I RJMCMC ⇒ Num. components (dimension) is a parameter.

I Convergence diagnostic: Castelloe and Zimmerman method.

2. Model fitting packages

I Model-based clustering for ordinal dataI R package called clustord (Fernandez and Ryan, soon in

CRAN) https://github.com/vuw-clustering/clustord

I Model-based clustering for mixed-type dataI R package called clustMD (McParland and Gormley,2017)

https://cran.r-project.org/web/packages/clustMD/clustMD.pdf

I Model-based clustering for Gaussian dataI R package called mclust (Scrucca, 2019)

https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html

3. Example

Level of Depression Data Set

3. Example. Level of Depression Dataset

I Patients admitted for deliberated self-harm at the medicaldepartments of 3 major hospitals in Eastern Norway.

I Questionnaire designed to assess the level of depression.I 13 questions(rows), 151 patients (columns).I Ordinal data: 4 categories. From 1 (lower level) to 4 (higher level)I For instance, ”Sadness”

1 I do not feel sad2 I feel sad most of the time3 I am sad all the time4 I am so sad or unhappy that I can’t stand it

I Possible research questions:I Can we group patients/questions together?I Which questions or patients are similar?I Which questions or patients tend to be linked with higher

values of the ordinal response?

3. Results Model Fitting - EM algorithm

Table: Level of Depression. Model Fitting (1/3)

Model R C npar AIC AICc BIC ICL.BICNull effects µk + φk 1 1 5 441.63 441.81 460.71 460.71Row effects µk + φkαi n 1 16 428.81 430.52 489.89 489.89

Column effects µk + φkβj 1 m 32 463.85 470.82 586.00 586.00Main effects µk + φk (αi + βj ) n m 43 422.54 421.50 547.67 547.67

Row Clustering

µk + φkαr

2 1 7 415.70 416.04 442.42 442.493 1 9 419.42 419.97 453.77 470.374 1 11 423.36 424.17 465.35 481.865 1 13 427.40 428.53 477.02 496.256 1 15 430.96 432.46 488.22 488.24

µk + φk (αr + βj )

2 m 34 431.02 438.92 560.80 572.873 n 20 435.91 444.82 573.33 594.324 n 22 439.57 449.55 584.62 593.905 n 24 443.91 455.03 596.60 599.436 n 26 447.69 460.02 608.01 618.21

µk + φk (αr + βj + γrj )

2 m 61 406.22 423.83 629.06 639.083 n 42 424.71 491.57 668.25 776.264 n 55 426.25 558.47 680.49 681.495 n 68 549.95 585.80 681.88 684.896 n 81 531.77 630.58 707.40 717.40

Model R C npar AIC AICc BIC ICL.BIC

Column Clustering

µk + φkβc

1 2 7 412.46 412.81 439.18 463.051 3 9 418.12 418.67 452.47 482.001 4 11 421.90 422.71 463.89 515.371 5 13 426.43 427.56 476.06 507.191 6 15 429.96 431.46 487.22 547.28

µk + φk (αi + βc )

n 2 18 410.13 415.81 520.82 526.18n 3 20 397.28 409.28 561.54 565.73n 4 22 401.23 413.55 607.22 609.89n 5 24 412.15 447.29 671.71 675.77n 6 26 460.91 513.21 770.10 772.98

µk + φk (αi + βc + γic )

n 2 29 534.06 538.66 664.21 669.38n 3 42 436.57 439.24 512.92 542.04n 4 55 440.43 443.66 524.41 549.82n 5 68 444.03 447.89 535.64 554.73n 6 81 450.14 454.68 549.38 595.48

Model R C npar AIC AICc BIC ICL.BIC

Biclustering

µk + φk (αr + βc )

2 2 9 421.76 422.31 456.11 498.312 3 11 419.64 420.20 454.00 490.752 4 13 425.74 426.88 475.37 549.882 5 15 431.31 432.81 488.56 572.193 2 11 423.22 424.03 465.20 517.863 3 13 476.66 477.79 501.77 526.293 4 15 439.87 441.37 497.13 522.803 5 17 435.21 437.13 500.10 567.884 2 13 482.98 484.11 492.13 532.604 3 15 433.70 435.20 490.96 550.304 4 17 435.22 437.14 500.11 571.154 5 19 464.04 466.44 536.56 568.45

µk + φk (αr + βc + γrc )

2 2 10 427.97 429.10 477.59 527.432 3 13 422.00 422.68 460.17 486.882 4 16 434.39 436.09 495.46 520.852 5 19 438.61 441.01 511.13 538.563 2 13 497.76 498.89 505.27 547.383 3 17 433.91 435.84 498.80 540.763 4 21 441.89 444.83 522.05 559.233 5 25 453.08 457.27 548.50 615.814 2 16 445.85 447.55 506.92 528.754 3 21 448.82 451.76 528.98 538.184 4 26 468.71 473.25 567.95 622.254 5 31 530.60 537.12 619.79 648.93

3. Results. Model Selection. AIC

I Best AIC model: Column clustering model with C = 3groups of patients

3. Results. Common Visualisation Tools

Figure: Level of Depression: Column Clustering with C=3 patient groups

3. Results. Common Visualisation Tools

Figure: Level of Depression C=3: Distribution in each group

The proportion of individuals in clusters that had at least oneepisode of DSH (deliberated self-harm, i.e. predictor of suicide(Hawron et al. 2013)) within 3 months is: 3.4%, 16%, and 28%.

3. Results. More Visualisation Tools

Use the fitted score parameters φk: determining the spacingamong categories.

Level of depression data: φ0 = 0, φ1 = 0.347, φ2 = 0.853, φ3 = 1.

3. Spaced Mosaic Plot (Fernandez et al, 2015)

I No rows (questions) orcolumn (indiv.) groups.

I Overall distribution ⇒Frequency of each ordinalcategory.

I No rows (questions) orcolumn (indiv.) groups.

I Overall distribution ⇒Frequency of each ordinalcategory.

I Level 2 ⇒ Most common.I Level 4 ⇒ Less common.

Figure: Level of depression data: Mosaic plot for stereotype model includingcolumn clustering model with C = 3 column (patient) clusters.

I Column clusters ⇒ 3 horiz. bands.I Height of each band ⇒

Proportional to number of patientsper group (C1=8.6 + 21.6 + 7.8 + 4.2 = 42.2%).

I Area in each block ⇒ Freq. of the 4ordinal categories per cluster (e.g.patients of C2 ⇒ strong preferenceresponse at Level 1)

I Horizontal separation betweenblocks ⇒ Spacing between theadjacent ordinal categ.(φ1 = 0, φ2 = 0.347, φ3 = 0.852, φ4 = 1)

I Level 3 and 4 are very similar:φ4 − φ3 = 1− 0.852 = 0.148

3. Results. More Visualisation Tools. Fuzziness

Figure: Contour plot depicting the fuzzy clustering structure with C = 3patient clusters. The left figure is without any sorting and both axes are sortedby patient cluster on the right figure.

Probability two patients are classified in the same cluster.

4.Bayesian Inference

Bayesian Inference Approach

4. Developing RJMCMC. DAG

Figure: Directed acyclic graph: Hierarchical Stereotype Mixture Model.One–dimensional Clustering. ”TrGeometric” refers to a truncated Geometricdistribution.

4. Developing RJMCMC. Sweep

Table: RJMCMC Moves

Block Move Param. Prop. Constants Pr(Move) Move Type1 σ2

µ νσµ = 3 δσµ = 40 1 M-HHyperpar. σ2

α νσα = 3 δσα = 40 1 M-Hσ2β νσβ = 3 δσβ = 40 1 M-H

2 µk σ2µp = 0.3 1 M-H

General φk 1 M-HParameters βj σ2

βp= 0.3 1 M-H

3 αr σ2αp = 0.3 pα = 0.35 M-H

Cluster πr σ2πp = 0.3 pπ = 0.35 M-H

Parameters Split p = 0.3 pS = p ρ1+ρ RJ

Merge pM = p 11+ρ RJ

4. Developing RJMCMC. Split Step

I Split and Merge steps involve αR and πRI Steps have to be reversible and keep the constraints

r=1 αr = 0,∑R

r=1 πr = 1)

I Split move:1 Draw u1, u2 ∼ U(0, 1) and one r ∈ 1, . . . ,R.2 New parameters:

α(t)r = u1α

(t−1)r α

(t)r+1 = (1− u1)α(t−1)

π(t)r = u2π

(t−1)r π

(t)r+1 = (1− u2)π(t−1)

3 Increase R by 1.4 Relabel r + 1, . . .R as r + 2, . . .R + 1

4. Developing RJMCMC. Split Step

I Split and Merge steps involve αR and πRI Steps have to be reversible and keep the constraints

r=1 αr = 0,∑R

r=1 πr = 1)

I Split move:1 Draw u1, u2 ∼ U(0, 1) and one r ∈ 1, . . . ,R.2 New parameters:

α(t)r = u1α

(t−1)r α

(t)r+1 = (1− u1)α(t−1)

π(t)r = u2π

(t−1)r π

(t)r+1 = (1− u2)π(t−1)

3 Increase R by 1.4 Relabel r + 1, . . .R as r + 2, . . .R + 1

4. Developing RJMCMC. Merge Step

I Split and Merge move involve αR and πRI Moves have to be reversible and keep the constraints

r=1 αr = 0,∑R

r=1 πr = 1)

I Merge move:1 Draw one random component r ∈ 1, . . . ,R − 1.2 Selecting the adjacent component r + 1.3 New parameters:

α(t)r = α(t−1)

r + α(t−1)r+1

π(t)r = π(t−1)

r + π(t−1)r+1

4 Reduce R by 1.5 Relabel r + 2, . . .R as r + 1, . . .R − 1

4. Example. Level of depression data set. RJMCMC

Figure: Level of Depression: Dimension (Column) visits

4. Example. Level of depression data set. RJMCMC

Figure: Level of Depression C=3: Distribution in each group

5. Summary. Conclusions

I Clustering rows(columns) for ordinal data allows us to:I Describe data with fewer parameters than current methods.I Identify similar rows (i.e. questions.) and/or similar columns

(i.e. subjects).I Find an a posteriori classification.

I Likelihood-based stereotype models ⇒ Inferences andmodel comparison.

I Using the fitted score parameters φk among ordinalcategories, dictated by data

I Data visualisation tools for ordinal clustering data: spacedmosaic plots, fuzziness.

I Model fitting ⇒ EM algorithm (AIC), RJMCMC (Clustercomponent as a parameter).

References

I Anderson, J. A. (1984). Regression and ordered categorical variables. JRSS Series B,46(1):1-30.

I Castelloe, J. and Zimmerman, D. (2002) Convergence assessment for RJMCMC samplers.Technical Report 313, SAS Institute, Cary, North Carolina.

I Fernandez, D., Pledger, S. and Arnold, R. (2014). Introducing spaced mosaic plots.Research Report Series. ISSN: 1174-2011. 14-3, MSOR, VUW, 2014.

I Fernandez, D., Arnold, R. and Pledger, S. (2016). Mixture-based clustering for the orderedstereotype model. CSDA. 93. 46-75.

I Green, P. J. (1995). Reversible jump MCMC computation and Bayesian modeldetermination. Biometrika, (82):711-732, 1995.

I McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley Series in Probability andStatistics.

I Pledger, S. and Arnold, R (2014). Multivariate methods using mixtures: Correspondenceanalysis, scaling and pattern-detection. CSDA

I Stephens, M. (2000). Dealing with label switching in mixture models. JRSS, Series B, 62,795-809.

Thank you

Thank you for listening!