missing values adapting to missing data. sources of missing data people refuse to answer a question...

26
Missing Values Adapting to missing data

Post on 19-Dec-2015

232 views

Category:

Documents


1 download

TRANSCRIPT

Missing Values

Adapting to missing data

Sources of Missing Data

• People refuse to answer a question• Responses are indistinct or

ambiguous• Numeric data are obviously wrong• Broken objects cannot be

measured• Equipment failure or malfunction• Detailed analysis of subsample

Assumptions 1• Missing Completely at Random

– probability of data missing on X is unrelated to the value of X or to values on other variables in data set

• Missing at Random– the probability of missing data on X is

unrelated to the value of X after controlling for other variables in the analysis

Assumptions 2

• Ignorable– MAR plus parameters governing

missing data process unrelated to parameters being estimated

• Nonignorable– If not MAR, missing data mechanism

must be modeled to get good estimates of parameters

Methods

1. Listwise Deletion2. Pairwise Deletion3. Dummy Variable Adjustment4. Imputation

Listwise Deletion 1

• Delete any samples with missing data– Can be used for any statistical

analysis– No special computational methods

• If data are MCAR (esp if random sample of full data set), they are an unbiased estimate of the full data set

Listwise Delete 2

• If data are MAR, can produce biased estimates if missing values in independent variables are dependent on dependent variable

• Main issue is the loss of observations and the increase in standard errors (meaning a decrease in the power of the test)

Listwise Deletion 3

• In anthropology listwise deletion often includes removal of variables (columns) as well as cases (rows)

• Finding an optimal complete data set involves removing variables with many missing variables and then rows still having missing variables

Pairwise Deletion 1

• Compute means using available data and covariances using cases with observations for the pair being computed

• Uses more of the data• If MCAR, reasonably unbiased

estimates, but if MAR, estimates may be seriously biased

Pairwise Deletion 2

• Covariance/Correlation matrix may be singular

• Less of an issue with distance matrices

Dummy Variable

• Create variable to flag observations missing on a particular variable

• Used in regression analysis but provides biased estimators

Imputation

• Replace missing values with an estimate:1. Mean for that variable – biased

estimates of variances and covariances

2. Multiple regression to predict value – complicated with multiple variables containing missing values, but can still lead to underestimated standard errors

Maximum Likelihood

• Try to reconstruct the complete data set by selecting values that would maximize the probability of observing the actually observed data

• Categorical and continuous data• Expectation-maximization algorithm

gives estimates of means and covariances

Expectation Maximization

• Iterative steps of expectation and maximization to produce estimates that converge on the ML estimates

• These estimates will generally underestimate the standard errors in regression and other statistical models

Multiple Imputation 1

• Has the same optimal properties of ML but several advantages

• Can be used with any kind of data and any kind of statistical model

• But produces multiple estimates which must be combined

• Random component used to give unbiased estimates

Multiple Imputation 2

• Multivariate normal model (relatively resistant to deviations)

• Each variable represented as a linear function of the other variables

• Methods– Data Augmentation, package norm– Sampling Importance/Resampling,

package amelia

Multiple Imputation 3

• Categorical data, multinomial model, package cat

• Categorical and interval/ratio data, package mix

• Also can use multivariate normal models with dummy variables

Multiple Imputation 4

• Predictive mean matching – use regression to predict values for a particular variable. Find complete cases that have predictions similar to the case with a missing value on that variable and randomly one of the actual values, package Hmisc, function aregImpute

Analysis

• The analysis is run on each imputed data set and the estimates (e.g. regression coefficients are combined)

• Packages such as zelig provide ways of combining the datasets for generalized linear models

Missing Data with R 1

• NA is used to identify a missing value

• is.na() is used to test for a missing value: is.na(c(1:4, NA, 6:10))

• na.omit(dataframe) will delete all cases with missing data (Rcmdr: Data | Active Data set| Remove cases with missing values

Missing Data with R 2

• Some functions have an na.rm= option. True means remove cases with missing values, False means do not remove them so that the function returns NA if there are missing values.

Missing Data in R 3

• Other functions (e.g. lm, princomp, glm) have an na.action= option that must can be set to one of the following options: na.fail, na.omit, na.exclude to remove cases (omit, exclude) or have the analysis fail

Missing Data in R 4

• Other functions (e.g. cor, cov, var) have a use= option:– everything (NA’s propagate)– all.obs (NA causes error)– complete.obs (delete cases with NA’s)– na.or.complete (delete cases with NA’s)– pairwise.complete.obs (complete pairs

of observations)

Example 1

• ErnestWitte data set has missing values among the 242 cases and 38 variables

• Using R to remove all cases with missing values reduces the number of cases to 52!

• If we don’t need all of the variables we can retain more cases

Example 2

• Total NA’s in ErnestWitte (815)• sum(is.na(ErnestWitte))

• Check missing values by variable:• sort(apply(ErnestWitte, 2, function(x) sum(is.na(x))), decreasing=TRUE)

• Looking has 171, SkullPos 126, Depos 112

• Removing these gives 112 cases

Multiple Imputation with R

• A wide variety of options:– Packages norm, cat, mix– Package amelia– Package mi (relatively new, but

flexible)