missing data in research studies

Missing Data in Research Studies

Joseph A. Olsen

What do I do about missing data?

Introduction

• What is certain in life?– Death– Taxes

• What is certain in research?– Measurement error– Missing data

• Missing data can be:– Due to preventable errors, mistakes, or lack of foresight by the

researcher– Due to problems outside the control of the researcher – Deliberate, intended, or planned by the researcher to reduce cost or

respondent burden– Due to differential applicability of some items to subsets of

respondents – Etc.

Some Characteristics of Missing DataFacets of missing data

Persons Variables Occasions

Type of non-response Unit non-response Block non-response Wave non-response Item non-response

Special non-response problems in longitudinal and clustered data Attrition/drop-out Group (e. g. family) member non-response

Missing Data Mechanisms (1)

Preliminaries: Yobs: The non-missing or observed data Ymiss: The missing or unobserved data M: Whether the data on a given item for a given case is missing (1)

or not (0) Missing Completely at Random (MCAR)

The probability that an item is missing (M) is unrelated to either the observed (Yobs) or the unobserved (Ymiss) data

Missing at Random (MAR) The probability that an item is missing (M) may be related to the

observed data (Yobs) but is unrelated to the unobserved data (Ymiss) Missing Not at Random (MNAR)

The probability that an item is missing (M) is related to the (unknown) value of the unobserved data (Ymiss), even after conditioning on the observed data (Yobs)

Missing Data Mechanisms (2)

The appropriateness of different missing data treatments depends (among other things) on the underlying missing data mechanism

“Real” missing data can seldom be classified into just one of the three (MCAR, MAR, MNAR)

Because we don’t have access to the missing data (Ymiss), we can not empirically test whether or not the data is MNAR

If we know (or can convincingly argue) that the data is not MNAR, a test of whether the data is MCAR is available (e. g. in SPSS Missing Values Analysis).

Missing Data in Research Studies

Missing data mechanism Missing completely at random (MCAR)—Ignorable Missing at random (MAR)—Conditionally ignorable Missing not at random (MNAR)—Nonignorable

Amount of missing data Percent of cases with missing data Percent of variables having missing data Percent of data values that are missing

Pattern of missing data Missing by design Missing data patterns

UnivariateMonotonicFile matchingGeneral

Goals of a Missing Data Treatment

Preserve the essential characteristics of the dataDistributions of the variablesRelationships among the variables

Maintain the representativeness of the analyzed data

Provide valid statistical inference (control Type I error)

Maximize the statistical power of the study and its statistical analyses (minimize Type II error)

Avoid bias and instability in the parameter estimates and standard errors for statistical models

Older Missing Data Treatments (1)

Deletion methods Listwise deletion (complete case analysis) Pairwise deletion (available case analysis)

Cold deck imputation Deterministic, logical, or rule-based imputation Treat missing data for nominal predictors as an additional category

Hot deck (donor case) imputation Cluster based methods Distance based (e. g. nearest neighbor) methods

Mean substitution (Variable) mean substitution Mean substitution with added random error Predictor mean substitution with missing data dichotomy

Older Missing Data Treatments (2)

Regression imputation Regression predicted value imputation Regression imputation with added random error

Special methods for longitudinal studies and randomized controlled trials Endpoint only analysis Last observation carried forward (LOCF) Intent to treat worst (best) case imputation Summary growth parameters

Special methods for multi-item scales Available item method of scale construction Person mean imputation Two-way imputation Two-way imputation with added random error

Newer Missing Data Treatments

• Modern state-of-the-art missing data treatments for MAR data– Maximum likelihood– Multiple imputation

• Cutting edge investigational missing data treatments for MNAR data– Pattern mixture models– Selection models– Shared parameter models– Inverse probability weighting

Statistical Analysis with Missing Data What do you get when you don’t specify what you want? What

choices do you have within a given analysis procedure? Often, listwise deletion is the default (and only) option (SPSS

Reliability and GLM) Listwise default with pairwise and mean substitution as options

(SPSS Factor and Regression Analysis) Pairwise default with listwise option (SPSS Correlation)

Modeling approaches that incorporate missing data handling Survival models Mixed effects models Structural equation models

Missing data treatments carried out prior to analysis Ad hoc methods (Listwise, pairwise, single imputation, etc.) Modern methods(Maximum Likelihood, Multiple Imputation)

Modern Missing Data Treatments

Maximum likelihood (ML) Estimates summary statistics or statistical models using all available data Available in modern structural equation modeling software (Amos, EQS,

Lisrel, Mplus, Mx, etc.) The ML covariance matrix and mean vector can also be obtained from

SPSS MVA, and used for standard Regression, Factor analysis, Reliability, and other procedures

There are also freeware and open source programs that can produce the ML covariance matrix and mean vector, usually by using the Expectation Maximization (EM) algorithm (e.g. EMCOV)

Multiple imputation Imputes individual data values in multiple complete datasets, averaging the

results of the statistical analyses across these datasets Available in the current versions of certain SEM software (Amos, Mplus). Also available in SPSS (MVA), SAS (Proc MI and MIANALYZE), Stata (mi

impute and mi estimate), and stand-alone missing data packages such as SOLAS

Why do social scientists use modern missing data treatments so infrequently?

Lack of awareness or familiarityThey are not convinced of the problems with older

methodsThe statistical literature on missing data is technically

dauntingThe techniques aren’t incorporated into the standard

statistical analysis procedures used by social scientists

Journal reviewers and editors have not required it

missing data in research studies

Documents

missing data ymiss

missing data treatmentpreserve

unobserved data ymissmissing

spss missing values

observed yobs

random mcarignorablemissing

random marthe probability

random mnarthe probability