missing values adapting to missing data. sources of missing data people refuse to answer a question...
Post on 19-Dec-2015
232 views
TRANSCRIPT
Sources of Missing Data
• People refuse to answer a question• Responses are indistinct or
ambiguous• Numeric data are obviously wrong• Broken objects cannot be
measured• Equipment failure or malfunction• Detailed analysis of subsample
Assumptions 1• Missing Completely at Random
– probability of data missing on X is unrelated to the value of X or to values on other variables in data set
• Missing at Random– the probability of missing data on X is
unrelated to the value of X after controlling for other variables in the analysis
Assumptions 2
• Ignorable– MAR plus parameters governing
missing data process unrelated to parameters being estimated
• Nonignorable– If not MAR, missing data mechanism
must be modeled to get good estimates of parameters
Listwise Deletion 1
• Delete any samples with missing data– Can be used for any statistical
analysis– No special computational methods
• If data are MCAR (esp if random sample of full data set), they are an unbiased estimate of the full data set
Listwise Delete 2
• If data are MAR, can produce biased estimates if missing values in independent variables are dependent on dependent variable
• Main issue is the loss of observations and the increase in standard errors (meaning a decrease in the power of the test)
Listwise Deletion 3
• In anthropology listwise deletion often includes removal of variables (columns) as well as cases (rows)
• Finding an optimal complete data set involves removing variables with many missing variables and then rows still having missing variables
Pairwise Deletion 1
• Compute means using available data and covariances using cases with observations for the pair being computed
• Uses more of the data• If MCAR, reasonably unbiased
estimates, but if MAR, estimates may be seriously biased
Pairwise Deletion 2
• Covariance/Correlation matrix may be singular
• Less of an issue with distance matrices
Dummy Variable
• Create variable to flag observations missing on a particular variable
• Used in regression analysis but provides biased estimators
Imputation
• Replace missing values with an estimate:1. Mean for that variable – biased
estimates of variances and covariances
2. Multiple regression to predict value – complicated with multiple variables containing missing values, but can still lead to underestimated standard errors
Maximum Likelihood
• Try to reconstruct the complete data set by selecting values that would maximize the probability of observing the actually observed data
• Categorical and continuous data• Expectation-maximization algorithm
gives estimates of means and covariances
Expectation Maximization
• Iterative steps of expectation and maximization to produce estimates that converge on the ML estimates
• These estimates will generally underestimate the standard errors in regression and other statistical models
Multiple Imputation 1
• Has the same optimal properties of ML but several advantages
• Can be used with any kind of data and any kind of statistical model
• But produces multiple estimates which must be combined
• Random component used to give unbiased estimates
Multiple Imputation 2
• Multivariate normal model (relatively resistant to deviations)
• Each variable represented as a linear function of the other variables
• Methods– Data Augmentation, package norm– Sampling Importance/Resampling,
package amelia
Multiple Imputation 3
• Categorical data, multinomial model, package cat
• Categorical and interval/ratio data, package mix
• Also can use multivariate normal models with dummy variables
Multiple Imputation 4
• Predictive mean matching – use regression to predict values for a particular variable. Find complete cases that have predictions similar to the case with a missing value on that variable and randomly one of the actual values, package Hmisc, function aregImpute
Analysis
• The analysis is run on each imputed data set and the estimates (e.g. regression coefficients are combined)
• Packages such as zelig provide ways of combining the datasets for generalized linear models
Missing Data with R 1
• NA is used to identify a missing value
• is.na() is used to test for a missing value: is.na(c(1:4, NA, 6:10))
• na.omit(dataframe) will delete all cases with missing data (Rcmdr: Data | Active Data set| Remove cases with missing values
Missing Data with R 2
• Some functions have an na.rm= option. True means remove cases with missing values, False means do not remove them so that the function returns NA if there are missing values.
Missing Data in R 3
• Other functions (e.g. lm, princomp, glm) have an na.action= option that must can be set to one of the following options: na.fail, na.omit, na.exclude to remove cases (omit, exclude) or have the analysis fail
Missing Data in R 4
• Other functions (e.g. cor, cov, var) have a use= option:– everything (NA’s propagate)– all.obs (NA causes error)– complete.obs (delete cases with NA’s)– na.or.complete (delete cases with NA’s)– pairwise.complete.obs (complete pairs
of observations)
Example 1
• ErnestWitte data set has missing values among the 242 cases and 38 variables
• Using R to remove all cases with missing values reduces the number of cases to 52!
• If we don’t need all of the variables we can retain more cases
Example 2
• Total NA’s in ErnestWitte (815)• sum(is.na(ErnestWitte))
• Check missing values by variable:• sort(apply(ErnestWitte, 2, function(x) sum(is.na(x))), decreasing=TRUE)
• Looking has 171, SkullPos 126, Depos 112
• Removing these gives 112 cases