missing data in research studies
DESCRIPTION
Missing Data in Research Studies. Joseph A. Olsen. What do I do about missing data?. Introduction. What is certain in life? Death Taxes What is certain in research? Measurement error Missing data Missing data can be: - PowerPoint PPT PresentationTRANSCRIPT
Missing Data in Research Studies
Joseph A. Olsen
What do I do about missing data?
Introduction
• What is certain in life?– Death– Taxes
• What is certain in research?– Measurement error– Missing data
• Missing data can be:– Due to preventable errors, mistakes, or lack of foresight by the
researcher– Due to problems outside the control of the researcher – Deliberate, intended, or planned by the researcher to reduce cost or
respondent burden– Due to differential applicability of some items to subsets of
respondents – Etc.
Some Characteristics of Missing DataFacets of missing data
Persons Variables Occasions
Type of non-response Unit non-response Block non-response Wave non-response Item non-response
Special non-response problems in longitudinal and clustered data Attrition/drop-out Group (e. g. family) member non-response
Missing Data Mechanisms (1)
Preliminaries: Yobs: The non-missing or observed data Ymiss: The missing or unobserved data M: Whether the data on a given item for a given case is missing (1)
or not (0) Missing Completely at Random (MCAR)
The probability that an item is missing (M) is unrelated to either the observed (Yobs) or the unobserved (Ymiss) data
Missing at Random (MAR) The probability that an item is missing (M) may be related to the
observed data (Yobs) but is unrelated to the unobserved data (Ymiss) Missing Not at Random (MNAR)
The probability that an item is missing (M) is related to the (unknown) value of the unobserved data (Ymiss), even after conditioning on the observed data (Yobs)
Missing Data Mechanisms (2)
The appropriateness of different missing data treatments depends (among other things) on the underlying missing data mechanism
“Real” missing data can seldom be classified into just one of the three (MCAR, MAR, MNAR)
Because we don’t have access to the missing data (Ymiss), we can not empirically test whether or not the data is MNAR
If we know (or can convincingly argue) that the data is not MNAR, a test of whether the data is MCAR is available (e. g. in SPSS Missing Values Analysis).
Missing Data in Research Studies
Missing data mechanism Missing completely at random (MCAR)—Ignorable Missing at random (MAR)—Conditionally ignorable Missing not at random (MNAR)—Nonignorable
Amount of missing data Percent of cases with missing data Percent of variables having missing data Percent of data values that are missing
Pattern of missing data Missing by design Missing data patterns
UnivariateMonotonicFile matchingGeneral
Goals of a Missing Data Treatment
Preserve the essential characteristics of the dataDistributions of the variablesRelationships among the variables
Maintain the representativeness of the analyzed data
Provide valid statistical inference (control Type I error)
Maximize the statistical power of the study and its statistical analyses (minimize Type II error)
Avoid bias and instability in the parameter estimates and standard errors for statistical models
Older Missing Data Treatments (1)
Deletion methods Listwise deletion (complete case analysis) Pairwise deletion (available case analysis)
Cold deck imputation Deterministic, logical, or rule-based imputation Treat missing data for nominal predictors as an additional category
Hot deck (donor case) imputation Cluster based methods Distance based (e. g. nearest neighbor) methods
Mean substitution (Variable) mean substitution Mean substitution with added random error Predictor mean substitution with missing data dichotomy
Older Missing Data Treatments (2)
Regression imputation Regression predicted value imputation Regression imputation with added random error
Special methods for longitudinal studies and randomized controlled trials Endpoint only analysis Last observation carried forward (LOCF) Intent to treat worst (best) case imputation Summary growth parameters
Special methods for multi-item scales Available item method of scale construction Person mean imputation Two-way imputation Two-way imputation with added random error
Newer Missing Data Treatments
• Modern state-of-the-art missing data treatments for MAR data– Maximum likelihood– Multiple imputation
• Cutting edge investigational missing data treatments for MNAR data– Pattern mixture models– Selection models– Shared parameter models– Inverse probability weighting
Statistical Analysis with Missing Data What do you get when you don’t specify what you want? What
choices do you have within a given analysis procedure? Often, listwise deletion is the default (and only) option (SPSS
Reliability and GLM) Listwise default with pairwise and mean substitution as options
(SPSS Factor and Regression Analysis) Pairwise default with listwise option (SPSS Correlation)
Modeling approaches that incorporate missing data handling Survival models Mixed effects models Structural equation models
Missing data treatments carried out prior to analysis Ad hoc methods (Listwise, pairwise, single imputation, etc.) Modern methods(Maximum Likelihood, Multiple Imputation)
Modern Missing Data Treatments
Maximum likelihood (ML) Estimates summary statistics or statistical models using all available data Available in modern structural equation modeling software (Amos, EQS,
Lisrel, Mplus, Mx, etc.) The ML covariance matrix and mean vector can also be obtained from
SPSS MVA, and used for standard Regression, Factor analysis, Reliability, and other procedures
There are also freeware and open source programs that can produce the ML covariance matrix and mean vector, usually by using the Expectation Maximization (EM) algorithm (e.g. EMCOV)
Multiple imputation Imputes individual data values in multiple complete datasets, averaging the
results of the statistical analyses across these datasets Available in the current versions of certain SEM software (Amos, Mplus). Also available in SPSS (MVA), SAS (Proc MI and MIANALYZE), Stata (mi
impute and mi estimate), and stand-alone missing data packages such as SOLAS
Why do social scientists use modern missing data treatments so infrequently?
Lack of awareness or familiarityThey are not convinced of the problems with older
methodsThe statistical literature on missing data is technically
dauntingThe techniques aren’t incorporated into the standard
statistical analysis procedures used by social scientists
Journal reviewers and editors have not required it