data mining through linear modeling

28
Factors associated with the clinical characteristic “Dominance” published in “Psychosocial Treatments for Cocaine Dependence” [Arch Gen Psychiatry 56: June 1999] Tim Hare STA531 Fall 2009

Upload: timhare

Post on 14-Jun-2015

319 views

Category:

Documents


2 download

DESCRIPTION

Data mining via Linear Modeling

TRANSCRIPT

Page 1: Data Mining through Linear Modeling

Factors associated with the clinical characteristic “Dominance” published in “Psychosocial Treatments for Cocaine

Dependence”

[Arch Gen Psychiatry 56: June 1999]

Tim Hare

STA531 Fall 2009

Page 2: Data Mining through Linear Modeling

Cross-sectional (Month = 6) modeling of DOMI (Dominance) as a primary outcome

• The outcome measure under consideration is the clinical attribute/behavior “dominance”.

• Preliminary hypothesis directed the search somewhat: Is Dominance correlated with depression scores and psychological characteristics scores?

• A focus on scores for personality traits as well as assessed alcohol use yielded a subset of candidate predictors with strong correlation.

– VIND = vindictive– ALC_SUB = ASI Alcohol composite– INTR = intrusive– COLD

• Exploratory analysis of these to develop the hypothesis?

Page 3: Data Mining through Linear Modeling

Correlation data for raw score VIND vs. DOMI

Pearson Correlation Coefficients, N = 456

Prob > |r| under H0: Rho=0

DOMI VIND

DOMI 1.00000 0.80637<.0001

VIND 0.80637<.0001

1.00000

Binned by score trend of the95% CI’s suggests possible significance

of categorical based on binning

Page 4: Data Mining through Linear Modeling

Correlation data for raw score INTR vs DOMI

Pearson Correlation Coefficients, N = 456

Prob > |r| under H0: Rho=0

DOMI INTR

DOMI 1.00000 0.79210<.0001

INTR 0.79210<.0001

1.00000

Again, trend in the 95% CI’s

Page 5: Data Mining through Linear Modeling

Correlation data for raw score COLD vs DOMI

Pearson Correlation CoefficientsProb > |r| under H0: Rho=0

Number of Observations

DOMI COLD

DOMI 1.00000

456

0.73152<.0001

456

COLD 0.73152<.0001

456

1.00000

457

Again, trend in the 95% CI’s

Page 6: Data Mining through Linear Modeling

Cross-sectional (Month = 6) modeling of DOMI (Dominance) as a primary outcome

• The outcome measure under consideration is the clinical attribute/behavior “dominance”.

• A number of potential factors were evaluated, however a focus on related psychological scores as well as assessed alcohol use yielded a subset of relatively strong predictors.

– VIND = vindictive– ALC_SUB = ASI Alcohol composite– INTR = intrusive– COLD

• Based upon the preliminary exploration, the above patient scores were transformed to create a new set of 2-LEVEL categorical variables codified by assignment according to whether they were above or below their mean (e.g. “high” “low”)

– VINDGROUP– ALC_SUBGRUOP– INTRGROUP– COLDGROUP

• Additional crossed 2- and 3-way group variables were created based upon the ABOVE GROUP variables (e.g. combinations of “high” and “low” levels).

– VA = vindictive + ASI Alcohol composite– VI = vindictive + intrusive– VAI = vindictive + ASI Alcohol composite + intrusive

• Before we get started, what about any MODELING ASSUMPTIONS?

Page 7: Data Mining through Linear Modeling

OUTCOME MEASURE = DOMI Raw score NORMAL PROBABILITY PLOT

Pearson Correlation Coefficients, N = 456Prob > |r| under H0: Rho=0

DOMI zdomi

DOMI 1.00000 0.94328<.0001

zdomiRank for Variable DOMI

0.94328<.0001

1.00000

Page 8: Data Mining through Linear Modeling

Potential 2-way & 3-way CROSS FACTOR GROUP VIEWS

VINDGROUP-INTRGROUP

VINDGROUP-ALC_SUBGROUPVINDGROUP-ALC_SUBGROUP-INTRGROUP

1) Evidence for non-homogeneousvariance.

2) As well, groups areunbalanced (dissimilar counts)

USE PROC MIXED

Page 9: Data Mining through Linear Modeling

Any hypothesis suggest itself?

• Let’s take a look at the new CATEGORICAL VARIABLES we created, graphically…

• Let’s also take a look at the new observational CROSS FACTOR data groups...

• Let’s also look for any confounding 2-way or 3-way INTERACTIONS between the variables…

• Will a story emerge?

Page 10: Data Mining through Linear Modeling

NEW CATEGORICAL VARIABLES(NOT RAW DATA) MEAN DOMI SCORE by LEVEL

ALC_SUBGROUP

INTRGROUPVINDGROUP

Some good evidence that our2-LEVEL categorization correlates with

our OUTCOME measure.

What about combinationsof the above CATEGORIES by LEVEL?

Page 11: Data Mining through Linear Modeling

2-way- & 3-way CROSS FACTOR GROUPS MEAN DOMI SCORES

(VIND*INTR)

(VIND*ALC_SUB)

VIND*ALC_SUB*INTR

Good story to explore…

What about “interaction”???

Page 12: Data Mining through Linear Modeling

Possible interaction between VINDGROUP and ALC_SUBGROUP

Page 13: Data Mining through Linear Modeling

Possible interaction between INTRGROUP and ALC_SUBGROUP

Page 14: Data Mining through Linear Modeling

Possible interaction between INTRGROUP and VINDGROUP

Page 15: Data Mining through Linear Modeling

Graphical analysis to examine potential 3-WAY interaction VINDGROUP*INTRGROUP*ALC_SUBGROUP(LOW/HIGH)

ALC_SUBGROUP = LOW

ALC_SUBGROUP = HIGH

Page 16: Data Mining through Linear Modeling

PREVIEW for longitudinal modeling: Correlation noted in CROSS SECTIONAL (Month=6)

data seems to persist across time…

Therefore suspect “repeated measures”(longitudinal modeling) may model

well from the same variables

Page 17: Data Mining through Linear Modeling

Exploration leads to Hypothesis

• Incidence of the clinical characteristic “Dominance” can likely be explained by modeling with the categorical variables (VINDGROUP, INTRGROUP, ALC_SUBGROUP, COLD) derived from the raw scores.

• Interaction terms (VINDGROUP*ALC_SUBGROUP, VINDGROUP*INTRGROUP, INTRGROUP*ALC_SUBGROUP) will likely play a role in the modeling process given the preliminary 2-/3-way interaction plots (COLDGROUP*<other> not compelling, data not shown).

• Finally, there are some compelling graphs that suggest that interaction can be explained and may be significant in many cases (we’ll explore these further with contrasts later).

Page 18: Data Mining through Linear Modeling

Cross sectional (Month=6) modeling results confirm our suspicions

PROC MIXED (α=0.05)

Type 3 Tests of Fixed Effects

EffectNum

DFDen DF F Value Pr > F

COLDGROUP 1 447 13.79 0.0002

VINDGROUP 1 447 136.66 <.0001

ALC_SUBGROUP 1 447 6.13 0.0136

INTRGROUP 1 447 163.16 <.0001

VINDGROUP*INTRGROUP 1 447 33.72 <.0001

VINDGROUP*ALC_SUBGROUP 1 447 9.11 0.0027

VINDGROUP*ALC_SUBGROUP*INTRGROUP 2 447 5.50 0.0044

2X/3X

Page 19: Data Mining through Linear Modeling

CROSS SECTIONAL(Month=6) DOMI SCORE

LSMEANS for 3-way crossed observational groups

Least Squares Means

Effect VINDGROUP ALCSUBGROUP INTRGROUP Estimate Error DF t Value

VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 1_Q12INTR 1.2440 0.2747 447 4.53VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 2_Q34INTR 2.5681 0.6553 447 3.92VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 1_Q12INTR 1.9119 0.3445 447 5.55VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 2_Q34INTR 5.5909 0.6210 447 9.00VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 1_Q12INTR 3.5733 0.4871 447 7.34VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 2_Q34INTR 11.0133 0.3890 447 28.31 VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 1_Q12INTR 4.4362 0.5418 447 8.19VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 2_Q34INTR 9.7837 0.4263 447 22.95

Effect VINDGROUP ALCSUBGROUP INTRGROUP Pr > |t| Alpha Lower Upper

VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 1_Q12INTR <.0001 0.05 0.7042 1.7838VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 2_Q34INTR 0.0001 0.05 1.2802 3.8559VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 1_Q12INTR <.0001 0.05 1.2349 2.5889VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 2_Q34INTR <.0001 0.05 4.3705 6.8114VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 1_Q12INTR <.0001 0.05 2.6160 4.5306VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 2_Q34INTR <.0001 0.05 10.2489 11.7777VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 1_Q12INTR <.0001 0.05 3.3714 5.5009VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 2_Q34INTR <.0001 0.05 8.9460 10.6215

Page 20: Data Mining through Linear Modeling

If you’re Vindictive=Low, Intrusive=Low…you probably don’t have to worry about being

overly Dominant after drinking…Estimates

Label EstimateStandard

Error DF t Value Pr > |t|

LLL 1.2440 0.2747 447 4.53 <.0001

LHL 1.9119 0.3445 447 5.55 <.0001

Contrasts

LabelNum

DFDen DF F Value Pr > F

LHL-LLL 1 447 2.82 0.0938

At 95% conf.Level: notsignificant

Page 21: Data Mining through Linear Modeling

Does intrusiveness trump alcohol in low vindictives?

Estimates

Label EstimateStandard

Error DF t Value Pr > |t|

LLL 1.24 0.27 447 4.53 <.0001

LHL 1.91 0.34 447 5.55 <.0001

LHH 5.59 0.62 447 9.00 <.0001

LLH 2.57 0.66 447 3.92 0.0001

AVG(LLL,LHL) 1.58 0.24 447 6.58 <.0001

AVG(LHH,LLH) 4.08 0.45 447 9.04 <.0001

Contrasts

LabelNum

DFDen DF F Value Pr > F

LHL-LLL 1 447 2.82 0.093

LHH-LLH 1 447 11.21 0.0009

AVG(LHH,LLH)-AVG(LLL,LHL)

1 447 24.87 <.0001

VL AL/H IL

VL AL/H IH

Page 22: Data Mining through Linear Modeling

CROSS SECTIONAL MODEL: (Month=6) Good fit? Residuals plots…

Pearson Correlation Coefficients, N = 456Prob > |r| under H0: Rho=0

Resid zresidResidResidual

1.00000 0.97126<.0001

zresidRank for Variable Resid

0.97126<.0001

1.00000

Page 23: Data Mining through Linear Modeling

Repeated Measures Analysis of the entire Month1-Month6 data set

• The original cross sectional model was evaluated for COV / VAR structure (e.g. adjust for possible correlation in longitudinal data) by comparison

– CS, UN, AR(1) TOEP, CSH, ARH(1)

• COV / VAR type “UN” was retained as smallest -2ResLogLikelihood, significantly smaller than all the rest, adjusting for DF.

Type 3 Tests of Fixed Effects

Effect

Num

DF

Den

DF F Value Pr > F

COLDGROUP 1 102 22.20 <.0001

VINDGROUP 1 102 132.23 <.0001

ALC_SUBGROUP 1 102 0.06 0.8081

INTRGROUP 1 102 125.09 <.0001

VINDGROUP*INTRGROUP 1 102 39.29 <.0001

VINDGROUP*ALC_SUBGRO 1 102 6.93 0.0098

VINDGR*ALC_SU*INTRGR 2 102 7.16 0.0012

ALC_SUBGROUP was retained due to participation in

higher order terms of significance.

Page 24: Data Mining through Linear Modeling

LONGITUDINAL MODEL: Good fit? Residuals plots…

Pearson Correlation Coefficients, N = 456Prob > |r| under H0: Rho=0

Resid zresid

ResidResidual

1.00000 0.95069<.0001

zresidRank for Variable Resid

0.95069<.0001

1.00000

Page 25: Data Mining through Linear Modeling

Repeated Measures (COV=UN,

no terms removed)

Least Squares Means

Effect VINDGROUP ALC_SUBGROUP INTRGROUP Estimate Error DF t Value VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 1_Q12INTR 1.6757 0.2996 102 5.59 VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 2_Q34INTR 2.9805 0.5583 102 5.34 VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 1_Q12INTR 2.0854 0.3452 102 6.0 VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 2_Q34INTR 4.1859 0.5689 102 7.36 VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 1_Q12INTR 3.5756 0.4301 102 8.31 VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 2_Q34INTR 10.0167 0.3986 102 25.13 VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 1_Q12INTR 4.2246 0.5035 102 8.39 VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 2_Q34INTR 8.0591 0.4076 102 19.77

Effect VINDGROUP ALC_SUBGROUP INTRGROUP Pr > |t| Alpha Lower Upper VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 1_Q12INTR <.0001 0.05 1.0815 2.2699 VINDGR*ALC_SU*INTRGR 1_Q12VIND 1_Q12ALC_SUB 2_Q34INTR <.0001 0.05 1.8730 4.0879 VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 1_Q12INTR <.0001 0.05 1.4006 2.7702 VINDGR*ALC_SU*INTRGR 1_Q12VIND 2_Q34ALC_SUB 2_Q34INTR <.0001 0.05 3.0575 5.3143 VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 1_Q12INTR <.0001 0.05 2.7224 4.4288 VINDGR*ALC_SU*INTRGR 2_Q34VIND 1_Q12ALC_SUB 2_Q34INTR <.0001 0.05 9.2261 10.8073 VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 1_Q12INTR <.0001 0.05 3.2259 5.2234 VINDGR*ALC_SU*INTRGR 2_Q34VIND 2_Q34ALC_SUB 2_Q34INTR <.0001 0.05 7.2506 8.8676

Estimates

Label EstimateStandard

Error DF t Value Pr > |t|

LLL 1.6757 0.2996 102 5.59 <.0001

LHL 2.0854 0.3452 102 6.04 <.0001

Contrasts

LabelNum

DFDen DF F Value Pr > F

LHL-LLL 1 102 1.57 0.2124

At 95% conf.Level: notsignificant

Page 26: Data Mining through Linear Modeling

What about our cross sectional contrast of alcohol being trumped by intrusiveness in low vindictives? Does it still hold in the longitudinal analysis?

Estimates

Label EstimateStandard

Error DF t Value Pr > |t|

LLL 1.6757 0.2996 102 5.59 <.0001

LHL 2.0854 0.3452 102 6.04 <.0001

LHH 4.1859 0.5689 102 7.36 <.0001

LLH 2.9805 0.5583 102 5.34 <.0001

AVG(LLL,LHL) 1.8806 0.2790 102 6.74 <.0001

AVG(LHH,LLH) 3.5832 0.4247 102 8.44 <.0001

Contrasts

LabelNum

DFDen DF F Value Pr > F

LHL-LLL 1 102 1.57 0.2124

LHH-LLH 1 102 2.65 0.1069

AVG(LHH,LLH)-AVG(LLL,LHL) 1 102 15.41 0.0002

Still significantAt 95% CI

Page 27: Data Mining through Linear Modeling

Conclusions• The interrelationship between the clinical attribute “dominance” and

the related attributes “intrusive” “vindictiveness”, “cold”, along with the ASI Alcohol Composite score, combine to model the incidence of dominance in the clinical data.

• Both longitudinal modeling (using the entire data set) and cross sectional modeling (Month=6) support the same conclusions.

• We can use CONTRASTS to profitably to validate the BARCHARTS showing possible differences in relationship between the interactions of 3 key variables (VINDGROUP, INTRGROUP, and ALC_SUBGROUP) types that correlated with DOMINANCE, but have complex INTERACTIONS.

Page 28: Data Mining through Linear Modeling

Q&A